Introduction to Statistics with R PDF
Document Details
Uploaded by EasierConsciousness
2023
Edward Chi
Tags
Summary
This textbook introduces statistics using R, a popular statistical software. Sections from other LibreTexts books are compiled. The content covers various statistical methods and their R implementations.
Full Transcript
INTRODUCTION TO STATISTICS WITH R Edward Chi Cerritos College Note to Students and Instructors April 27, 2023 Dear Students and Instructors, This textbook is an initial attempt at creating a free introduction to statistics textbook that incorporates the free and popular statistical software, R. Th...
INTRODUCTION TO STATISTICS WITH R Edward Chi Cerritos College Note to Students and Instructors April 27, 2023 Dear Students and Instructors, This textbook is an initial attempt at creating a free introduction to statistics textbook that incorporates the free and popular statistical software, R. This book is a collection of sections pulled from other free textbooks published on LibreTexts. Because some sections are from one author's book and other sections are from another author's book, there are some inconsistencies. I apologize in advance for the confusion this will cause and thank you for your understanding. Sincerely, W. Edward Chi 1 https://stats.libretexts.org/@go/page/36103 Introduction to Statistics with R This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all, pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully consult the applicable license(s) before pursuing such effects. Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new technologies to support learning. The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open- access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields) integrated. The LibreTexts libraries are Powered by NICE CXOne and are supported by the Department of Education Open Textbook Pilot Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120, 1525057, and 1413739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation nor the US Department of Education. Have questions or comments? For information about adoptions or adaptions contact [email protected]. More information on our activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog (http://Blog.Libretexts.org). This text was compiled on 12/19/2023 TABLE OF CONTENTS Note to Students and Instructors Licensing 1: Basics 1.1: Introduction 1.1.1: What Is Statistical Thinking? 1.1.2: Dealing with Statistics Anxiety 1.1.3: What Can Statistics Do for Us? 1.1.4: The Big Ideas of Statistics 1.1.5: Causality and Statistics 1.2: Working with Data 1.2.1: What Are Data? 1.2.2: Data Basics 1.2.3: Scales of Measurement 1.2.4: What Makes a Good Measurement? 1.2.5: Overview of Data Collection Principles 1.2.6: Observational Studies and Sampling Strategies 1.2.7: Experiments 1.2.8: How Not to Do Statistics 1.2.9: Exercises 2: Introduction to R 2.1: Why Programming Is Hard to Learn 2.2: Using RStudio 2.3: Installing R 2.4: Getting Started with R 2.5: Variables 2.6: Functions 2.7: Letting RStudio Help You with Your Commands 2.8: Vectors 2.9: Math with Vectors 2.10: Data Frames 2.11: Using R Libraries 2.12: Installing and Loading Packages 2.13: Using Comments 2.14: Navigating the File System 2.15: Loading and Saving Data 2.16: Useful Things to Know about Variables 2.17: Factors 2.18: Data frames 2.19: Suggested Readings and Videos 3: Summarizing Data Visually 3.1: Qualitative Data 3.2: Quantitative Data 1 https://stats.libretexts.org/@go/page/36323 3.3: Other Graphical Representations of Data 3.4: Statistical Literacy 4: Summarizing Data Visually Using R 4.1: An Overview of R Graphics 4.2: An Introduction to Plotting 4.3: Histograms 4.4: Stem and Leaf Plots 4.5: Scatterplots 4.6: Bar Graphs 4.7: Saving Image Files Using R and Rstudio 4.8: Summary 5: Summarizing Data With Numbers 5.1: Central Tendency 5.2: What is Central Tendency 5.3: Measures of Central Tendency 5.4: Median and Mean 5.5: Measures of the Location of the Data 5.6: Additional Measures 5.7: Comparing Measures 5.8: Variability 5.9: Measures of Variability 5.10: Shapes of Distributions 5.11: Effects of Linear Transformations 5.12: Variance Sum Law I - Uncorrelated Variables 5.13: Statistical Literacy 5.14: Case Study- Using Stents to Prevent Strokes 5.15: Measures of the Location of the Data (Exercises) 5.E: Summarizing Distributions (Exercises) 6: Describing Data With Numbers Using R 6.1: Measures of Central Tendency 6.2: Measures of Variability 6.3: Skew and Kurtosis 6.4: Getting an Overall Summary of a Variable 6.5: Descriptive Statistics Separately for each Group 6.6: Standard Scores 6.7: Epilogue- Good Descriptive Statistics Are Descriptive! 7: Introduction to Probability 7.1: How are Probability and Statistics Different? 7.2: What Does Probability Mean? 7.3: Basic Probability Theory 7.4: The Binomial Distribution 7.5: The Normal Distribution 7.6: Other Useful Distributions 7.7: Summary 7.8: Statistical Literacy 7.E: Probability (Exercises) 2 https://stats.libretexts.org/@go/page/36323 8: Estimating Unknown Quantities from a Sample 8.1: Samples, Populations and Sampling 8.2: The Law of Large Numbers 8.3: Sampling Distributions and the Central Limit Theorem 8.4: Estimating Population Parameters 8.5: Estimating a Confidence Interval 8.6: Summary 8.7: Statistical Literacy 8.E: Estimation (Exercises) 9: Hypothesis Testing 9.1: A Menagerie of Hypotheses 9.2: Two Types of Errors 9.3: Test Statistics and Sampling Distributions 9.4: Making Decisions 9.5: The p value of a test 9.6: Reporting the Results of a Hypothesis Test 9.7: Running the Hypothesis Test in Practice 9.8: Effect Size, Sample Size and Power 9.9: Some Issues to Consider 9.10: Misconceptions of Hypothesis Testing 9.11: Summary 9.12: Statistical Literacy 9.13: Logic of Hypothesis Testing (Exercises) 10: Categorical Data Analysis 10.1: The χ2 Goodness-of-fit Test 10.2: The χ2 test of independence (or association) 10.3: The Continuity Correction 10.4: Effect Size 10.5: Assumptions of the Test(s) 10.6: The Most Typical Way to Do Chi-square Tests in R 10.7: The Fisher Exact Test 10.8: The McNemar Test 10.9: What’s the Difference Between McNemar and Independence? 10.10: Summary 10.11: Statistical Literacy 10.12: Chi Square (Exercises) 11: Comparing Two Means 11.1: The one-sample z-test 11.2: The One-sample t-test 11.3: The Independent Samples t-test (Student Test) 11.4: The Independent Samples t-test (Welch Test) 11.5: The Paired-samples t-test 11.6: One Sided Tests 11.7: Using the t.test() Function 11.8: Effect Size 11.9: Checking the Normality of a Sample 11.10: Testing Non-normal Data with Wilcoxon Tests 3 https://stats.libretexts.org/@go/page/36323 11.11: Summary 11.12: Statistical Literacy 11.E: Tests of Means (Exercises) 12: Comparing Several Means (One-way ANOVA) 12.1: Summary 12.2: An Illustrative Data Set 12.3: How ANOVA Works 12.4: Running an ANOVA in R 12.5: Effect Size 12.6: Multiple Comparisons and Post Hoc Tests 12.7: Assumptions of One-way ANOVA 12.8: Checking the Homogeneity of Variance Assumption 12.9: Removing the Homogeneity of Variance Assumption 12.10: Checking the Normality Assumption 12.11: Removing the Normality Assumption 12.12: On the Relationship Between ANOVA and the Student t Test 13: Introduction to Linear Regression 13.1: Prelude to Linear Regression 13.2: Line Fitting, Residuals, and Correlation 13.3: Fitting a Line by Least Squares Regression 13.4: Types of Outliers in Linear Regression 13.5: Inference for Linear Regression 13.6: Exercises 14: Multiple and Logistic Regression 14.1: Introduction to Multiple Regression 14.2: Model Selection 14.3: Checking Model Assumptions using Graphs 14.4: Introduction to Logistic Regression 14.5: Exercises 14.6: Statistical Literacy 14.E: Regression (Exercises) 15: Regression in R 15.1: What Is a Linear Regression Model? 15.2: Estimating a Linear Regression Model 15.3: Multiple Linear Regression 15.4: Quantifying the Fit of the Regression Model 15.5: Hypothesis Tests for Regression Models 15.6: Correlations 15.7: Handling Missing Values 15.8: Testing the Significance of a Correlation 15.9: Regarding Regression Coefficients 15.10: Assumptions of Regression 15.11: Model Checking 15.12: Model Selection 15.13: Summary 4 https://stats.libretexts.org/@go/page/36323 16: Research Design 16.1: Scientific Method 16.2: Measurement 16.3: Data Collection 16.4: Sampling Bias 16.5: Experimental Designs 16.6: Causation 16.7: Statistical Literacy 16.E: Research Design (Exercises) 17: Preparing Datasets and Other Pragmatic Matters 17.1: Tabulating and Cross-tabulating Data 17.2: Transforming and Recoding a Variable 17.3: A few More Mathematical Functions and Operations 17.4: Extracting a Subset of a Vector 17.5: Extracting a Subset of a Data Frame 17.6: Sorting, Flipping and Merging Data 17.7: Reshaping a Data Frame 17.8: Working with Text 17.9: Reading Unusual Data Files 17.10: Coercing Data from One Class to Another 17.11: Other Useful Data Structures 17.12: Miscellaneous Topics 17.13: Summary 18: Basic Programming 18.1: Scripts 18.2: Loops 18.3: Conditional Statements 18.4: Writing Functions 18.5: Implicit Loops 18.6: Summary 19: Bayesian Statistics 19.1: Probabilistic Reasoning by Rational Agents 19.2: Bayesian Hypothesis Tests 19.3: Why Be a Bayesian? 19.4: Evidentiary Standards You Can Believe 19.5: The p-value Is a Lie. 19.6: Bayesian Analysis of Contingency Tables 19.7: Bayesian t-tests 19.8: Bayesian Regression 19.9: Bayesian ANOVA 19.10: Summary 20: Case Studies and Data 20.1: Angry Moods 20.2: Flatulence 20.3: Physicians Reactions 20.4: Teacher Ratings 5 https://stats.libretexts.org/@go/page/36323 20.5: Diet and Health 20.6: Smiles and Leniency 20.7: Animal Research 20.8: ADHD Treatment 20.9: Weapons and Aggression 20.10: SAT and College GPA 20.11: Stereograms 20.12: Driving 20.13: Stroop Interference 20.14: TV Violence 20.15: Obesity and Bias 20.16: Shaking and Stirring Martinis 20.17: Adolescent Lifestyle Choices 20.18: Chocolate and Body Weight 20.19: Bedroom TV and Hispanic Children 20.20: Weight and Sleep Apnea 20.21: Misusing SEM 20.22: School Gardens and Vegetable Consumption 20.23: TV and Hypertension 20.24: Dietary Supplements 20.25: Young People and Binge Drinking 20.26: Sugar Consumption in the US Diet 20.27: Nutrition Information Sources and Older Adults 20.28: Mind Set - Exercise and the Placebo Effect 20.29: Predicting Present and Future Affect 20.30: Exercise and Memory 20.31: Parental Recognition of Child Obesity 20.32: Educational Attainment and Racial, Ethnic, and Gender Disparity 21: Math Review for Introductory Statistics 00: Front Matter TitlePage InfoPage Table of Contents Licensing 21.1: Decimals Fractions and Percents 21.1.1: Comparing Fractions, Decimals, and Percents 21.1.2: Converting Between Fractions, Decimals and Percents 21.1.3: Decimals- Rounding and Scientific Notation 21.1.4: Using Fractions, Decimals and Percents to Describe Charts 21.2: The Number Line 21.2.1: Distance between Two Points on a Number Line 21.2.2: Plotting Points and Intervals on the Number Line 21.2.3: Represent an Inequality as an Interval on a Number Line 21.2.4: The Midpoint 21.3: Operations on Numbers 21.3.1: Area of a Rectangle 21.3.2: Factorials and Combination Notation 21.3.3: Order of Operations 21.3.4: Order of Operations in Expressions and Formulas 6 https://stats.libretexts.org/@go/page/36323 21.3.5: Perform Signed Number Arithmetic 21.3.6: Powers and Roots 21.3.7: Using Summation Notation 21.4: Sets 21.4.1: Set Notation 21.4.2: The Complement of a Set 21.4.3: The Union and Intersection of Two Sets 21.4.4: Venn Diagrams 21.5: Expressions, Equations and Inequalities 21.5.1: Evaluate Algebraic Expressions 21.5.2: Inequalities and Midpoints 21.5.3: Solve Equations with Roots 21.5.4: Solving Linear Equations in One Variable 21.6: Graphing Points and Lines in Two Dimensions 21.6.1: Finding Residuals 21.6.2: Find the Equation of a Line given its Graph 21.6.3: Find y given x and the Equation of a Line 21.6.4: Graph a Line given its Equation 21.6.5: Interpreting the Slope of a Line 21.6.6: Interpreting the y-intercept of a Line 21.6.7: Plot an Ordered Pair Index Glossary Detailed Licensing Index Glossary Detailed Licensing Detailed Licensing 7 https://stats.libretexts.org/@go/page/36323 Licensing A detailed breakdown of this resource's licensing can be found in Back Matter/Detailed Licensing. 1 https://stats.libretexts.org/@go/page/36324 CHAPTER OVERVIEW 1: Basics 1.1: Introduction 1.1.1: What Is Statistical Thinking? 1.1.2: Dealing with Statistics Anxiety 1.1.3: What Can Statistics Do for Us? 1.1.4: The Big Ideas of Statistics 1.1.5: Causality and Statistics 1.2: Working with Data 1.2.1: What Are Data? 1.2.2: Data Basics 1.2.3: Scales of Measurement 1.2.4: What Makes a Good Measurement? 1.2.5: Overview of Data Collection Principles 1.2.6: Observational Studies and Sampling Strategies 1.2.7: Experiments 1.2.8: How Not to Do Statistics 1.2.9: Exercises 1: Basics is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts. 1 SECTION OVERVIEW 1.1: Introduction Learning Objectives Having read this chapter, you should be able to: Describe the central goals and fundamental concepts of statistics Describe the difference between experimental and observational research with regard to what can be inferred about causality Explain how randomization provides the ability to make inferences about causation. “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” - H.G. Wells 1.1.1: What Is Statistical Thinking? 1.1.2: Dealing with Statistics Anxiety 1.1.3: What Can Statistics Do for Us? 1.1.4: The Big Ideas of Statistics 1.1.5: Causality and Statistics This page titled 1.1: Introduction is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.1.1 https://stats.libretexts.org/@go/page/35593 1.1.1: What Is Statistical Thinking? Statistical thinking is a way of understanding a complex world by describing it in relatively simple terms that nonetheless capture essential aspects of its structure, and that also provide us some idea of how uncertain we are about our knowledge. The foundations of statistical thinking come primarily from mathematics and statistics, but also from computer science, psychology, and other fields of study. We can distinguish statistical thinking from other forms of thinking that are less likely to describe the world accurately. In particular, human intuition often tries to answer the same questions that we can answer using statistical thinking, but often gets the answer wrong. For example, in recent years most Americans have reported that they think that violent crime was worse compared to the previous year (Pew Research Center). However, a statistical analysis of the actual crime data shows that in fact violent crime has steadily decreased since the 1990’s. Intuition fails us because we rely upon best guesses (which psychologists refer to as heuristics) that can often get it wrong. For example, humans often judge the prevalence of some event (like violent crime) using an availability heuristic – that is, how easily can we think of an example of violent crime. For this reason, our judgments of increasing crime rates may be more reflective of increasing news coverage, in spite of an actual decrease in the rate of crime. Statistical thinking provides us with the tools to more accurately understand the world and overcome the fallibility of human intuition. This page titled 1.1.1: What Is Statistical Thinking? is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.1: What Is Statistical Thinking? by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.1.1.1 https://stats.libretexts.org/@go/page/35594 1.1.2: Dealing with Statistics Anxiety Many people come to their first statistics class with a lot of trepidation and anxiety, especially once they hear that they will also have to learn to code in order to analyze data. In my class I give students a survey prior to the first session in order to measure their attitude towards statistics, asking them to rate a number of statments on a scale of 1 (strongly disagree) to 7 (strongly agree). One of the items on the survey is “The thought of being enrolled in a statistics course makes me nervous”. In the most recent class, almost two-thirds of the class responded with a five or higher, and about one-fourth of the students said that they strongly agreed with the statement. So if you feel nervous about starting to learn statistics, you are not alone. Anxiety feels uncomfortable, but psychology tells us that this kind of emotional arousal can actually help us perform better on many tasks, by focusing our attention So if you start to feel anxious about the material in this course, remind yourself that many others in the class are feeling similarly, and that the arousal could actually help you perform better (even if it doesn’t seem like it!). This page titled 1.1.2: Dealing with Statistics Anxiety is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.2: Dealing with Statistics Anxiety by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.1.2.1 https://stats.libretexts.org/@go/page/35595 1.1.3: What Can Statistics Do for Us? There are three major things that we can do with statistics: Describe: The world is complex and we often need to describe it in a simplified way that we can understand. Decide: We often need to make decisions based on data, usually in the face of uncertainty. Predict: We often wish to make predictions about new situations based on our knowledge of previous situations. Let’s look at an example of these in action, centered on a question that many of us are interested in: How do we decide what’s healthy to eat? There are many different sources of guidance, from government dietary guidelines to diet books to bloggers. Let’s focus in on a specific question: Is saturated fat in our diet a bad thing? One way that we might answer this question is common sense. If we eat fat then it’s going to turn straight into fat in our bodies, right? And we have all seen photos of arteries clogged with fat, so eating fat is going to clog our arteries, right? Another way that we might answer this question is by listening to authority figures. The Dietary Guidelines from the US Food and Drug Administration have as one of their Key Recommendations that “A healthy eating pattern limits saturated fats”.You might hope that these guidelines would be based on good science, and in some cases they are, but as Nina Teicholz outlined in her book “Big Fat Surprise”(Teicholz 2014), this particular recommendation seems to be based more on the dogma of nutrition researchers than on actual evidence. Finally, we might look at actual scientific research. Let’s start by looking at a large study called the PURE study, which has examined diets and health outcomes (including death) in more than 135,000 people from 18 different countries. In one of the analyses of this dataset (published in The Lancet in 2017; Dehghan et al. (2017)), the PURE investigators reported an analysis of how intake of various classes of macronutrients (including saturated fats and carbohydrates) was related to the likelihood of dying during the time that people were followed. People were followed for a median of 7.4 years, meaning that half of the people in the study were followed for less and half were followed for more than 7.4 years. Figure 1.1 plots some of the data from the study (extracted from the paper), showing the relationship between the intake of both saturated fats and carbohydrates and the risk of dying from any cause. Figure 1.1: A plot of data from the PURE study, showing the relationship between death from any cause and the relative intake of saturated fats and carbohydrates. This plot is based on ten numbers. To obtain these numbers, the researchers split the group of 135,335 study participants (which we call the “sample”) into 5 groups (“quintiles”) after ordering them in terms of their intake of either of the nutrients; the first quintile contains the 20% of people with the lowest intake, and the 5th quintile contains the 20% with the highest intake. The researchers then computed how often people in each of those groups died during the time they were being followed. The figure expresses this in terms of the relative risk of dying in comparison to the lowest quintile: If this number is greater than 1 it means that people in the group are more likely to die than are people in the lowest quintile, whereas if it’s less than one it means that people in the group are less likely to die. The figure is pretty clear: People who ate more saturated fat were less likely to die during the study, with the lowest death rate seen for people who were in the fourth quintile (that is, who ate more fat than the lowest 60% but less than the top 1.1.3.1 https://stats.libretexts.org/@go/page/35596 20%). The opposite is seen for carbohydrates; the more carbs a person ate, the more likely they were to die during the study. This example shows how we can use statistics to describe a complex dataset in terms of a much simpler set of numbers; if we had to look at the data from each of the study participants at the same time, we would be overloaded with data and it would be hard to see the pattern that emerges when they are described more simply. The numbers in Figure 1.1 seem to show that deaths decrease with saturated fat and increase with carbohydrate intake, but we also know that there is a lot of uncertainty in the data; there are some people who died early even though they ate a low-carb diet, and, similarly, some people who ate a ton of carbs but lived to a ripe old age. Given this variability, we want to decide whether the relationships that we see in the data are large enough that we wouldn’t expect them to occur randomly if there was not truly a relationship between diet and longevity. Statistics provide us with the tools to make these kinds of decisions, and often people from the outside view this as the main purpose of statistics. But as we will see throughout the book, this need for black-and-white decisions based on fuzzy evidence has often led researchers astray. Based on the data we would also like to make predictions about future outcomes. For example, a life insurance company might want to use data about a particular person’s intake of fat and carbohydrate to predict how long they are likely to live. An important aspect of prediction is that it requires us to generalize from the data we already have to some other situation, often in the future; if our conclusions were limited to the specific people in the study at a particular time, then the study would not be very useful. In general, researchers must assume that their particular sample is representative of a larger population, which requires that they obtain the sample in a way that provides an unbiased picture of the population. For example, if the PURE study had recruited all of its participants from religious sects that practice vegetarianism, then we probably wouldn’t want to generalize the results to people who follow different dietary standards. This page titled 1.1.3: What Can Statistics Do for Us? is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.3: What Can Statistics Do for Us? by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.1.3.2 https://stats.libretexts.org/@go/page/35596 1.1.4: The Big Ideas of Statistics There are a number of very basic ideas that cut through nearly all aspects of statistical thinking. Several of these are outlined by Stigler (2016) in his outstanding book “The Seven Pillars of Statistical Wisdom”, which I have augmented here. 1.4.1 Learning from data One way to think of statistics is as a set of tools that enable us to learn from data. In any situation, we start with a set of ideas or hypotheses about what might be the case. In the PURE study, the researchers may have started out with the expectation that eating more fat would lead to higher death rates, given the prevailing negative dogma about saturated fats. Later in the course we will introduce the idea of prior knowledge, which is meant to reflect the knowledge that we bring to a situation. This prior knowledge can vary in its strength, often based on our amount of experience; if I visit a restaurant for the first time I am likely to have a weak expectation of how good it will be, but if I visit a restaurant where I have eaten ten times before, my expectations will be much stronger. Similarly, if I look at a restaurant review site and see that a restaurant’s average rating of four stars is only based on three reviews, I will have a weaker expectation than I would if it was based on 300 reviews. Statistics provides us with a way to describe how new data can be best used to update our beliefs, and in this way there are deep links between statistics and psychology. In fact, many theories of human and animal learning from psychology are closely aligned with ideas from the new field of machine learning. Machine learning is a field at the interface of statistics and computer science that focuses on how to build computer algorithms that can learn from experience. While statistics and machine learning often try to solve the same problems, researchers from these fields often take very different approaches; the famous statistician Leo Breiman once referred to them as “The Two Cultures” to reflect how different their approaches can be (Breiman 2001). In this book I will try to blend the two cultures together because both approaches provide useful tools for thinking about data. 1.4.2 Aggregation Another way to think of statistics is “the science of throwing away data”. In the example of the PURE study above, we took more than 100,000 numbers and condensed them into ten. It is this kind of aggregation that is one of the most important concepts in statistics. When it was first advanced, this was revolutionary: If we throw out all of the details about every one of the participants, then how can we be sure that we aren’t missing something important? As we will see, statistics provides us ways to characterize the structure of aggregates of data, and with theoretical foundations that explain why this usually works well. However, it’s also important to keep in mind that aggregation can go too far, and later we will encounter cases where a summary can provide a misleading picture of the data being summarized. 1.4.3 Uncertainty The world is an uncertain place. We now know that cigarette smoking causes lung cancer, but this causation is probabilistic: A 68- year-old man who smoked two packs a day for the past 50 years and continues to smoke has a 15% (1 out of 7) risk of getting lung cancer, which is much higher than the chance of lung cancer in a nonsmoker. However, it also means that there will be many people who smoke their entire lives and never get lung cancer. Statistics provides us with the tools to characterize uncertainty, to make decisions under uncertainty, and to make predictions whose uncertainty we can quantify. One often sees journalists write that scientific researchers have “proven” some hypothesis. But statistical analysis can never “prove” a hypothesis, in the sense of demonstrating that it must be true (as one would in a logical or mathematical proof). Statistics can provide us with evidence, but it’s always tentative and subject to the uncertainty that is always present in the real world. 1.4.4 Sampling The concept of aggregation implies that we can make useful insights by collapsing across data – but how much data do we need? The idea of sampling says that we can summarize an entire population based on just a small number of samples from the population, as long as those samples are obtained in the right way. For example, the PURE study enrolled a sample of about 135,000 people, but its goal was to provide insights about the billions of humans who make up the population from which those people were sampled. As we already discussed above, the way that the study sample is obtained is critical, as it determines how broadly we can generalize the results. Another fundamental insight about sampling is that while larger samples are always better (in terms of their ability to accurately represent the entire population), there are diminishing returns as the sample gets larger. In fact, 1.1.4.1 https://stats.libretexts.org/@go/page/35597 the rate at which the benefit of larger samples decreases follows a simple mathematical rule, growing as the square root of the sample size, such that in order to double the quality of our data we need to quadruple the size of our sample. This page titled 1.1.4: The Big Ideas of Statistics is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.4: The Big Ideas of Statistics by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.1.4.2 https://stats.libretexts.org/@go/page/35597 1.1.5: Causality and Statistics The PURE study seemed to provide pretty strong evidence for a positive relationship between eating saturated fat and living longer, but this doesn’t tell us what we really want to know: If we eat more saturated fat, will that cause us to live longer? This is because we don’t know whether there is a direct causal relationship between eating saturated fat and living longer. The data are consistent with such a relationship, but they are equally consistent with some other factor causing both higher saturated fat and longer life. For example, it is likely that people who are richer eat more saturated fat and richer people tend to live longer, but their longer life is not necessarily due to fat intake — it could instead be due to better health care, reduced psychological stress, better food quality, or many other factors. The PURE study investigators tried to account for these factors, but we can’t be certain that their efforts completely removed the effects of other variables. The fact that other factors may explain the relationship between saturated fat intake and death is an example of why introductory statistics classes often teach that “correlation does not imply causation”, though the renowned data visualization expert Edward Tufte has added, “but it sure is a hint.” Although observational research (like the PURE study) cannot conclusively demonstrate causal relations, we generally think that causation can be demonstrated using studies that experimentally control and manipulate a specific factor. In medicine, such a study is referred to as a randomized controlled trial (RCT). Let’s say that we wanted to do an RCT to examine whether increasing saturated fat intake increases life span. To do this, we would sample a group of people, and then assign them to either a treatment group (which would be told to increase their saturated fat intake) or a control group (who would be told to keep eating the same as before). It is essential that we assign the individuals to these groups randomly. Otherwise, people who choose the treatment might be different in some way than people who choose the control group – for example, they might be more likely to engage in other healthy behaviors as well. We would then follow the participants over time and see how many people in each group died. Because we randomized the participants to treatment or control groups, we can be reasonably confident that there are no other differences between the groups that would confound the treatment effect; however, we still can’t be certain because sometimes randomization yields treatment versus control groups that do vary in some important way. Researchers often try to address these confounds using statistical analyses, but removing the influence of a confound from the data can be very difficult. A number of RCTs have examined the question of whether changing saturated fat intake results in better health and longer life. These trials have focused on reducing saturated fat because of the strong dogma amongst nutrition researchers that saturated fat is deadly; most of these researchers would have probably argued that it was not ethical to cause people to eat more saturated fat! However, the RCTs have show a very consistent pattern: Overall there is no appreciable effect on death rates of reducing saturated fat intake. This page titled 1.1.5: Causality and Statistics is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.5: Causality and Statistics by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.1.5.1 https://stats.libretexts.org/@go/page/35598 SECTION OVERVIEW 1.2: Working with Data Learning Objectives Having read this chapter, you should be able to: Distinguish between different types of variables (quantitative/qualitative, binary/integer/real, discrete/continuous) and give examples of each of these kinds of variables Distinguish between the concepts of reliability and validity and apply each concept to a particular dataset 1.2.1: What Are Data? 1.2.2: Data Basics 1.2.3: Scales of Measurement 1.2.4: What Makes a Good Measurement? 1.2.5: Overview of Data Collection Principles 1.2.6: Observational Studies and Sampling Strategies 1.2.7: Experiments 1.2.8: How Not to Do Statistics 1.2.9: Exercises This page titled 1.2: Working with Data is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.2.1 https://stats.libretexts.org/@go/page/35599 1.2.1: What Are Data? The first important point about data is that data are - meaning that the word “data” is plural (though some people disagree with me on this). You might also wonder how to pronounce “data” – I say “day-tah” but I know many people who say “dah-tah” and I have been able to remain friends with them in spite of this. Now if I heard them say “the data is” then that would be bigger issue… 2.1.1 Qualitative data Data are composed of variables, where a variable reflects a unique measurement or quantity. Some variables are qualitative, meaning that they describe a quality rather than a numeric quantity. For example, in my stats course I generally give an introductory survey, both to obtain data to use in class and to learn more about the students. One of the questions that I ask is “What is your favorite food?”, to which some of the answers have been: blueberries, chocolate, tamales, pasta, pizza, and mango. Those data are not intrinsically numerical; we could assign numbers to each one (1=blueberries, 2=chocolate, etc), but we would just be using the numbers as labels rather than as real numbers; for example, it wouldn’t make sense to add the numbers together in this case. However, we will often code qualitative data using numbers in order to make them easier to work with, as you will see later. 2.1.2 Quantitative data More commonly in statistics we will work with quantitative data, meaning data that are numerical. For example, here Table 2.1 shows the results from another question that I ask in my introductory class, which is “Why are you taking this class?” Table 2.1: Counts of the prevalence of different responses to the question “Why are you taking this class?” Why are you taking this class? Number of students It fulfills a degree plan requirement 105 It fulfills a General Education Breadth Requirement 32 It is not required but I am interested in the topic 11 Other 4 Note that the students’ answers were qualitative, but we generated a quantitative summary of them by counting how many students gave each response. 2.1.2.1 Types of numbers There are several different types of numbers that we work with in statistics. It’s important to understand these differences, in part because programming languages like R often distinguish between them. Binary numbers. The simplest are binary numbers – that is, zero or one. We will often use binary numbers to represent whether something is true or false, or present or absent. For example, I might ask 10 people if they have ever experienced a migraine headache, recording their answers as “Yes” or “No”. It’s often useful to instead use logical values, which take the value of either TRUE or FALSE. We can create these by testing whether each value is equal to “Yes”, which we can do using the == symbol. This will return the value TRUE for any matching “Yes” values, and FALSE otherwise. These are useful to R knows how to interpret them natively, whereas it doesn’t know what “Yes” and “No” mean. In general, most programming languages treat truth values and binary numbers equivalently. The number 1 is equal to the logical value TRUE , and the number zero is equal to the logical value FALSE. Integers. Integers are whole numbers with no fractional or decimal part. We most commonly encounter integers when we count things, but they also often occur in psychological measurement. For example, in my introductory survey I administer a set of questions about attitudes towards statistics (such as “Statistics seems very mysterious to me.”), on which the students respond with a number between 1 (“Disagree strongly”) and 7 (“Agree strongly”). Real numbers. Most commonly in statistics we work with real numbers, which have a fractional/decimal part. For example, we might measure someone’s weight, which can be measured to an arbitrary level of precision, from whole pounds down to micrograms. 1.2.1.1 https://stats.libretexts.org/@go/page/35600 This page titled 1.2.1: What Are Data? is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 2.1: What Are Data? by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21- core-site. 1.2.1.2 https://stats.libretexts.org/@go/page/35600 1.2.2: Data Basics Effective presentation and description of data is a first step in most analyses. This section introduces one structure for organizing data as well as some terminology that will be used throughout this book. Observations, variables, and data matrices Table 1.3 displays rows 1, 2, 3, and 50 of a data set concerning 50 emails received during early 2012. These observations will be referred to as the email50 data set, and they are a random sample from a larger data set that we will see in Section 1.7. Table 1.3: Four rows from the email 50 data matrix. spam num_char line_breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 50 no 15,829 242 html small Each row in the table represents a single email or case (a case is also sometimes called a unit of observation or an observational unit.). The columns represent characteristics, called variables, for each of the emails. For example, the first row represents email 1, which is a not spam, contains 21,705 characters, 551 line breaks, is written in HTML format, and contains only small numbers. In practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood. For instance, it is always important to be sure we know what each variable means and the units of measurement. Descriptions of all five email variables are given in Table 1.4. Table 1.4: Variables and their descriptions for the email 50 data set. variable description spam Specifies whether the message was spam num_char The number of characters in the email line_breaks The number of line breaks in the email (not including text wrapping) Indicates if the email contained special formatting, such as bolding, format tables, or links, which would indicate the message is in HTML format Indicates whether the email contained no number, a small number number (under1 million), or a large number The data in Table 1.3 represent a data matrix, which is a common way to organize data. Each row of a data matrix corresponds to a unique case, and each column corresponds to a variable. A data matrix for the stroke study introduced in Section 1.1 is shown in Table 1.1, where the cases were patients and there were three variables recorded for each patient. Data matrices are a convenient way to record and store data. If another individual or case is added to the data set, an additional row can be easily added. Similarly, another column can be added for a new variable. Exercise 1.2.2.1 Exercise 1.2 We consider a publicly available data set that summarizes information about the 3,143 counties in the United states, and we call this the county data set. This data set includes information about each county: its name, the state where it resides, its population in 2000 and 2010, per capita federal spending, poverty rate, and ve additional characteristics. How might these data be organized in a data matrix? Reminder: look in the footnotes for answers to in-text exercises.5 5 Each county may be viewed as a case, and there are eleven pieces of information recorded for each case. A table with 3,143 rows and 11 columns could hold these data, where each row represents a county and each column represents a particular piece of information. 1.2.2.1 https://stats.libretexts.org/@go/page/35601 Seven rows of the county data set are shown in Table 1.5, and the variables are summarized in Table 1.6. These data were collected from the US Census website.6 6quickfacts.census.gov/qfd/index.html Table 1.5: Seven rows from the county data set. home pop pop multiu- med smoking name state fed spend poverty owner- income 2000 2010 nit income ban ship 1 Autau-ga AL 43671 54571 6.068 10.6 77.5 7.2 24568 53255 none Baldw- 2 AL 140415 182265 6.140 12.2 76.7 22.6 26469 50147 none in Barbo- 3 AL 29038 27457 8.752 25.0 68.0 11.1 15875 33219 none ur 4 Bibb AL 20826 22915 7.122 12.6 82.9 6.6 19918 41770 none 5 Blount AL 51024 57322 5.131 13.4 82.0 3.7 21070 45549 none ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Wash- 3142 WY 8289 8533 8.714 5.6 70.9 10.0 28557 48379 none akie 3143 West-on WY 6644 7208 6.695 7.9 77.9 6.5 28463 53853 none Table 1.6: Variables and their descriptions for the county data set. variable description name County name state State where the county resides (also including the District of Columbia) pop2000 Population in 2000 pop2010 Population in 2010 fed_spend Federal spending per capita poverty Percent of the population in poverty homeownership Percent of the population that lives in their own home or lives with the owner (e.g. children living with parents who own the home) multiunit Percent of living units that are in multi-unit structures (e.g. apartments) income Income per capita med_income Median household income for the county, where a household's income equals the total income of its occupants who are 15 years or older smoking_ban Type of county-wide smoking ban in place at the end of 2011, which takes one of three values: none, partial, or comprehensive, where a comprehensive ban means smoking was not permitted in restaurants, bars, or workplaces, and partial means smoking was banned in at least one of those three locations Types of variables Examine the fed spend, pop2010, state, and smoking ban variables in the county data set. Each of these variables is inherently different from the other three yet many of them share certain characteristics. 1.2.2.2 https://stats.libretexts.org/@go/page/35601 First consider fed spend, which is said to be a numerical variable since it can takea wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. On the other hand, we would not classify a variable reporting telephone area codes as numerical since their average, sum, and difference have no clear meaning. The pop2010 variable is also numerical, although it seems to be a little different than fed spend. This variable of the population count can only take whole non-negative numbers (0, 1, 2,...). For this reason, the population variable is said to be discrete since it can only take numerical values with jumps. On the other hand, the federal spending variable is said to be continuous. The variable state can take up to 51 values after accounting for Washington, DC: AL,..., and WY. Because the responses themselves are categories, state is called a categorical variable,7 and the possible values are called the variable's levels. Finally, consider the smoking ban variable, which describes the type of county-wide smoking ban and takes values none, partial, or comprehensive in each county. This variable seems to be a hybrid: it is a categorical variable but the levels have a natural ordering. A variable with these properties is called an ordinal variable. To simplify analyses, any ordinal variables in this book will be treated as categorical variables. Example 1.3 Data were collected about students in a statistics course. Three variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course. Classify each of the variables as continuous numerical, discrete numerical, or categorical. The number of siblings and student height represent numerical variables. Because the number of siblings is a count, it is discrete. Height varies continuously, so it is a continuous numerical variable. The last variable classi es students into two categories - those who have and those who have not taken a statistics course - which makes this variable categorical. Exercise 1.2.2.1 Exercise 1.4 Consider the variables group and outcome (at 30 days) from the stent study in Section 1.1. Are these numerical or categorical variables?8 8 There are only two possible values for each variable, and in both cases they describe categories. Thus, each are categorical variables. 7 Sometimes also called a nominal variable. Relationships between variables Many analyses are motivated by a researcher looking for a relationship between two or more variables. A social scientist may like to answer some of the following questions: 1. Is federal spending, on average, higher or lower in counties with high rates of poverty? 2. If homeownership is lower than the national average in one county, will the percent of multi-unit structures in that county likely be above or below the national average? 3. Which counties have a higher average income: those that enact one or more smoking bans or those that do not? To answer these questions, data must be collected, such as the county data set shown in Table 1.5. Examining summary statistics could provide insights for each of the three questions about counties. Additionally, graphs can be used to visually summarize data and are useful for answering such questions as well. 1.2.2.3 https://stats.libretexts.org/@go/page/35601 Figure 1.8: A scatterplot showing fed spend against poverty. Owsley County of Kentucky, with a poverty rate of 41.5% and federal spending of $21.50 per capita, is highlighted. Scatterplots are one type of graph used to study the relationship between two numerical variables. Figure 1.8 compares the variables fed spend and poverty. Each point on the plot represents a single county. For instance, the highlighted dot corresponds to County 1088 in the county data set: Owsley County, Kentucky, which had a poverty rate of 41.5% and federal spending of $21.50 per capita. The scatterplot suggests a relationship between the two variables: counties with a high poverty rate also tend to have slightly more federal spending. We might brainstorm as to why this relationship exists and investigate each idea to determine which is the most reasonable explanation. Exercise 1.2.2.1 Exercise 1.5 Examine the variables in the email50 data set, which are described in Table 1.4 on page 4. Create two questions about the relationships between these variables that are of interest to you.9 9Two sample questions: (1) Intuition suggests that if there are many line breaks in an email then there would tend to also be many characters: does this hold true? (2) Is there a connection between whether an email format is plain text (versus HTML) and whether it is a spam message? The fed_spend and poverty variables are said to be associated because the plot shows a discernible pattern. When two variables show some connection with one another, they are called associated variables. Associated variables can also be called dependent variables and vice-versa. Example 1.2.2.1 Example 1.6 This example examines the relationship between homeownership and the percent of units in multi-unit structures (e.g. apartments, condos), which is visualized using a scatterplot in Figure 1.9. Are these variables associated? 1.2.2.4 https://stats.libretexts.org/@go/page/35601 Figure 1.9: A scatterplot of homeownership versus the percent of units that are in multi-unit structures for all 3,143 counties. Solution It appears that the larger the fraction of units in multi-unit structures, the lower the homeownership rate. Since there is some relationship between the variables, they are associated. Because there is a downward trend in Figure 1.9 { counties with more units in multiunit structures are associated with lower homeownership - these variables are said to be negatively associated. A positive association is shown in the relationship between the poverty and fed spend variables represented in Figure 1.8, where counties with higher poverty rates tend to receive more federal spending per capita. If two variables are not associated, then they are said to be independent. That is, two variables are independent if there is no evident relationship between the two. Associated or independent, never both A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent. This page titled 1.2.2: Data Basics is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.3: Data Basics by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel is licensed CC BY-SA 3.0. Original source: https://www.openintro.org/book/os. 1.2.2.5 https://stats.libretexts.org/@go/page/35601 1.2.3: Scales of Measurement 2.4.1 Scales of measurement All variables must take on at least two different possible values (otherwise they would be a constant rather than a variable), but different values of the variable can relate to each other in different ways, which we refer to as scales of measurement. There are four ways in which the different values of a variable can differ. Identity: Each value of the variable has a unique meaning. Magnitude: The values of the variable reflect different magnitudes and have an ordered relationship to one another — that is, some values are larger and some are smaller. Equal intervals: Units along the scale of measurement are equal to one another. This means, for example, that the difference between 1 and 2 would be equal in its magnitude to the difference between 19 and 20. Absolute zero: The scale has a true meaningful zero point. For example, for many measurements of physical quantities such as height or weight, this is the complete absence of the thing being measured. There are four different scales of measurement that go along with these different ways that values of a variable can differ. Nominal scale. A nominal variable satisfies the criterion of identity, such that each value of the variable represents something different, but the numbers simply serve as qualitative labels as discussed above. For example, we might ask people for their political party affiliation, and then code those as numbers: 1 = “Republican”, 2 = “Democrat”, 3 = “Libertarian”, and so on. However, the different numbers do not have any ordered relationship with one another. Ordinal scale. An ordinal variable satisfies the criteria of identity and magnitude, such that the values can be ordered in terms of their magnitude. For example, we might ask a person with chronic pain to complete a form every day assessing how bad their pain is, using a 1-7 numeric scale. Note that while the person is presumably feeling more pain on a day when they report a 6 versus a day when they report a 3, it wouldn’t make sense to say that their pain is twice as bad on the former versus the latter day; the ordering gives us information about relative magnitude, but the differences between values are not necessarily equal in magnitude. Interval scale. An interval scale has all of the features of an ordinal scale, but in addition the intervals between units on the measurement scale can be treated as equal. A standard example is physical temperature measured in Celsius or Farenheit; the physical difference between 10 and 20 degrees is the same as the physical difference between 90 and 100 degrees, but each scale can also take on negative values. Ratio scale. A ratio scale variable has all four of the features outlined above: identity, magnitude, equal intervals, and absolute zero. The difference between a ratio scale variable and an interval scale variable is that the ratio scale variable has a true zero point. Examples of ratio scale variables include physical height and weight, along with temperature measured in Kelvin. There are two important reasons that we must pay attention to the scale of measurement of a variable. First, the scale determines what kind of mathematical operations we can apply to the data (see Table 2.2). A nominal variable can only be compared for equality; that is, do two observations on that variable have the same numeric value? It would not make sense to apply other mathematical operations to a nominal variable, since they don’t really function as numbers in a nominal variable, but rather as labels. With ordinal variables, we can also test whether one value is greater or lesser than another, but we can’t do any arithmetic. Interval and ratio variables allow us to perform arithmetic; with interval variables we can only add or subtract values, whereas with ratio variables we can also multiply and divide values. Table 2.2: Different scales of measurement admit different types of numeric operations Equal/not equal >/< +/- Multiply/divide Nominal OK Ordinal OK OK Interval OK OK OK Ratio OK OK OK OK These constraints also imply that there are certain kinds of statistics that we can compute on each type of variable. Statistics that simply involve counting of different values (such as the most common value, known as the mode), can be calculated on any of the variable types. Other statistics are based on ordering or ranking of values (such as the median, which is the middle value when all 1.2.3.1 https://stats.libretexts.org/@go/page/35602 of the values are ordered by their magnitude), and these require that the value at least be on an ordinal scale. Finally, statistics that involve adding up values (such as the average, or mean), require that the variables be at least on an interval scale. Having said that, we should note that it’s quite common for researchers to compute the mean of variables that are only ordinal (such as responses on personality tests), but this can sometimes be problematic. This page titled 1.2.3: Scales of Measurement is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 2.4: Appendix by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.2.3.2 https://stats.libretexts.org/@go/page/35602 1.2.4: What Makes a Good Measurement? In many fields such as psychology, the thing that we are measuring is not a physical feature, but instead is an unobservable theoretical concept, which we usually refer to as a construct. For example, let’s say that I want to test how well you understand the distinction between the four different scales of measurement described above. I could give you a pop quiz that would ask you several questions about these concepts and count how many you got right. This test might or might not be a good measurement of the construct of your actual knowledge — for example, if I were to write the test in a confusing way or use language that you don’t understand, then the test might suggest you don’t understand the concepts when really you do. On the other hand, if I give a multiple choice test with very obvious wrong answers, then you might be able to perform well on the test even if you don’t actually understand the material. It is usually impossible to measure a construct without some amount of error. In the example above, you might know the answer but you might mis-read the question and get it wrong. In other cases there is error intrinsic to the thing being measured, such as when we measure how long it takes a person to respond on a simple reaction time test, which will vary from trial to trial for many reasons. We generally want our measurement error to be as low as possible. Sometimes there is a standard against which other measurements can be tested, which we might refer to as a “gold standard” — for example, measurement of sleep can be done using many different devices (such as devices that measure movement in bed), but they are generally considered inferior to the gold standard of polysomnography (which uses measurement of brain waves to quantify the amount of time a person spends in each stage of sleep). Often the gold standard is more difficult or expensive to perform, and the cheaper method is used even though it might have greater error. When we think about what makes a good measurement, we usually distinguish two different aspects of a good measurement. 2.5.1 Reliability Reliability refers to the consistency of our measurements. One common form of reliability, known as “test-retest reliability”, measures how well the measurements agree if the same measurement is performed twice. For example, I might give you a questionnaire about your attitude towards statistics today, repeat this same questionnaire tomorrow, and compare your answers on the two days; we would hope that they would be very similar to one another, unless something happened in between the two tests that should have changed your view of statistics (like reading this book!). Another way to assess reliability comes in cases where the data includes subjective judgments. For example, let’s say that a researcher wants to determine whether a treatment changes how well an autistic child interacts with other children, which is measured by having experts watch the child and rate their interactions with the other children. In this case we would like to make sure that the answers don’t depend on the individual rater — that is, we would like for there to be high inter-rater reliability. This can be assessed by having more than one rater perform the rating, and then comparing their ratings to make sure that they agree well with one another. Reliability is important if we want to compare one measurement to another. The relationship between two different variables can’t be any stronger than the relationship between either of the variables and itself (i.e., its reliability). This means that an unreliable measure can never have a strong statistical relationship with any other measure. For this reason, researchers developing a new measurement (such as a new survey) will often go to great lengths to establish and improve its reliability. 1.2.4.1 https://stats.libretexts.org/@go/page/35603 Figure 2.1: A figure demonstrating the distinction between reliability and validity, using shots at a bullseye. Reliability refers to the consistency of location of shots, and validity refers to the accuracy of the shots with respect to the center of the bullseye. 2.5.2 Validity Reliability is important, but on its own it’s not enough: After all, I could create a perfectly reliable measurement on a personality test by re-coding every answer using the same number, regardless of how the person actually answers. We want our measurements to also be valid — that is, we want to make sure that we are actually measuring the construct that we think we are measuring (Figure 2.1). There are many different types of validity that are commonly discussed; we will focus on three of them. Face validity. Does the measurement make sense on its face? If I were to tell you that I was going to measure a person’s blood pressure by looking at the color of their tongue, you would probably think that this was not a valid measure on its face. On the other hand, using a blood pressure cuff would have face validity. This is usually a first reality check before we dive into more complicated aspects of validity. Construct validity. Is the measurement related to other measurements in an appropriate way? This is often subdivided into two aspects. Convergent validity means that the measurement should be closely related to other measures that are thought to reflect the same construct. Let’s say that I am interested in measuring how extroverted a person is using a questionnaire or an interview. Convergent validity would be demonstrated if both of these different measurements are closely related to one another. On the other hand, measurements thought to reflect different constructs should be unrelated, known as divergent validity. If my theory of personality says that extraversion and conscientiousness are two distinct constructs, then I should also see that my measurements of extraversion are unrelated to measurements of conscientiousness. Predictive validity. If our measurements are truly valid, then they should also be predictive of other outcomes. For example, let’s say that we think that the psychological trait of sensation seeking (the desire for new experiences) is related to risk taking in the real world. To test for predictive validity of a measurement of sensation seeking, we would test how well scores on the test predict scores on a different survey that measures real-world risk taking. This page titled 1.2.4: What Makes a Good Measurement? is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 2.5: What Makes a Good Measurement? by Russell A. Poldrack is licensed CC BY-NC 2.0. Original source: https://statsthinking21.github.io/statsthinking21-core-site. 1.2.4.2 https://stats.libretexts.org/@go/page/35603 1.2.5: Overview of Data Collection Principles The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals. Populations and samples Consider the following three research questions: 1. What is the average mercury content in sword sh in the Atlantic Ocean? 2. Over the last 5 years, what is the average time to degree for Duke undergraduate students? 3. Does a new drug reduce the number of deaths in patients with severe heart disease? Each research question refers to a target population. In the rst question, the target population is all sword sh in the Atlantic ocean, and each sh represents a case. Often times, it is too expensive to collect data for every case in a population. Instead, a sample is taken. A sample represents a subset of the cases and is often a small fraction of the population. For instance, 60 sword sh (or some other number) in the population might be selected, and this sample data may be used to provide an estimate of the population average and answer the research question. Exercise Exercise 1.7 For the second and third questions above, identify the target population and what represents an individual case.10 Anecdotal Evidence Consider the following possible responses to the three research questions: 1. A man on the news got mercury poisoning from eating sword sh, so the average mercury concentration in sword sh must be dangerously high. 2. I met two students who took more than 7 years to graduate from Duke, so it must take longer to graduate at Duke than at many other colleges. 3. My friend's dad had a heart attack and died after they gave him a new heart disease drug, so the drug must not work. Each of the conclusions are based on some data. However, there are two problems. First, the data only represent one or two cases. Second, and more importantly, it is unclear whether these cases are actually representative of the population. Data collected in this haphazard fashion are called anecdotal evidence. 10(2)Notice that the rst question is only relevant to students who complete their degree; the average cannot be computed using a student who never nished her degree. Thus, only Duke undergraduate students who have graduated in the last ve years represent cases in the population under consideration. Each such student would represent an individual case. (3) A person with severe heart disease represents a case. The population includes all people with severe heart disease. 1.2.5.1 https://stats.libretexts.org/@go/page/35604 Figure 1.10: In February 2010, some media pundits cited one large snow storm as valid evidence against global warming. As comedian Jon Stewart pointed out, “It’s one storm, in one region, of one country.” Anecdotal evidence Be careful of data collected in a haphazard fashion. Such evidence may be true and veri able, but it may only represent extraordinary cases. Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics. For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years. Instead of looking at the most unusual cases, we should examine a sample of many cases that represent the population. Sampling from a Population We might try to estimate the time to graduation for Duke undergraduates in the last 5 years by collecting a sample of students. All graduates in the last 5 years represent the population, and graduates who are selected for review are collectively called the sample. In general, we always seek to randomly select a sample from a population. The most basic type of random selection is equivalent to how raffles are conducted. For example, in selecting graduates, we could write each graduate's name on a raffle ticket and draw 100 tickets. The selected names would represent a random sample of 100 graduates. Why pick a sample randomly? Why not just pick a sample by hand? Consider the following scenario. Example Example 1.8 Suppose we ask a student who happens to be majoring in nutrition to select several graduates for the study. What kind of students do you think she might collect? Do you think her sample would be representative of all graduates? Perhaps she would pick a disproportionate number of graduates from health-related fields. Or perhaps her selection would be well-representative of the population. When selecting samples by hand, we run the risk of picking a biased sample, even if that bias is unintentional or difficult to discern. 1.2.5.2 https://stats.libretexts.org/@go/page/35604 Figure 1.11: In this graphic, five graduates are randomly selected from the population to be included in the sample. Figure 1.12: Instead of sampling from all graduates equally, a nutrition major might inadvertently pick graduates with health related majors disproportionally often. If someone was permitted to pick and choose exactly which graduates were included in the sample, it is entirely possible that the sample could be skewed to that person's interests, which may be entirely unintentional. This introduces bias into a sample. Sampling randomly helps resolve this problem. The most basic random sample is called a simple random sample, and it is the equivalent of using a raffle to select cases. This means that each case in the population has an equal chance of being included and there is no implied connection between the cases in the sample. The act of taking a simple random sample helps minimize bias, however, bias can crop up in other ways. Even when people are picked at random, e.g. for surveys, caution must be exercised if the non-response is high. For instance, if only 30% of the people randomly sampled for a survey actually respond, then it is unclear whether the results are representative of the entire population. This non-response bias can skew results. Another common downfall is a convenience sample, where individuals who are easily accessible are more likely to be included in the sample. For instance, if a political survey is done by stopping people walking in the Bronx, this will not represent all of New York City. It is often diffcult to discern what sub-population a convenience sample represents. Exercise Exercise 1.9 We can easily access ratings for products, sellers, and companies through websites. These ratings are based only on those people who go out of their way to provide a rating. If 50% of online reviews for a product are negative, do you think this means that 50% of buyers are dissatisfied with the product?11 11 Answers will vary. From our own anecdotal experiences, we believe people tend to rant more about products that fell below expectations than rave about those that perform as expected. For this reason, we suspect there is a negative bias in product ratings on sites like Amazon. However, since our experiences may not be representative, we also keep an open mind should data on the subject become available. 1.2.5.3 https://stats.libretexts.org/@go/page/35604 Figure 1.13: Due to the possibility of non-response, surveys studies may only reach a certain group within the population. It is difficult, and often times impossible, to completely x this problem. Explanatory and Response Variables Consider the following question from page 7 for the county data set: (1) Is federal spending, on average, higher or lower in counties with high rates of poverty? If we suspect poverty might a ect spending in a county, then poverty is the explanatory variable and federal spending is the response variable in the relationship.12 If there are many variables, it may be possible to consider a number of them as explanatory variables. TIP: Explanatory and response variables To identify the explanatory variable in a pair of variables, identify which of the two is suspected of a ecting the other and plan an appropriate analysis. might affe c t explanatory variable−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ response variable (1.2.5.1) Caution: association does not imply causation Labeling variables as explanatory and response does not guarantee the relationship between the two is actually causal, even if there is an association identi ed between the two variables. We use these labels only to keep track of which variable we suspect a ects the other. In some cases, there is no explanatory or response variable. Consider the following question from page 7: (2) If homeownership is lower than the national average in one county, will the percent of multi-unit structures in that county likely be above or below the national average? It is difficult to decide which of these variables should be considered the explanatory and response variable, i.e. the direction is ambiguous, so no explanatory or response labels are suggested here. 12 Sometimes the explanatory variable is called the independent variable and the response variable is called the dependent variable. However, this becomes confusing since a pair of variables might be independent or dependent, so we avoid this language. Introducing observational studies and experiments There are two primary types of data collection: observational studies and experiments. Researchers perform an observational study when they collect data in a way that does not directly interfere with how the data arise. For instance, researchers may collect information via surveys, review medical or company records, or follow a cohort of many similar individuals to study why certain diseases might develop. In each of these situations, researchers merely observe the data that arise. In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection. When researchers want to investigate the possibility of a causal connection, they conduct an experiment. Usually there will be both an explanatory and a response variable. For instance, we may suspect administering a drug will reduce mortality in heart attack patients over the following year. To check if there really is a causal connection between the explanatory variable and the response, researchers will collect a sample of individuals and split them into groups. The individuals in each group are assigned a 1.2.5.4 https://stats.libretexts.org/@go/page/35604 treatment. When individuals are randomly assigned to a group, the experiment is called a randomized experiment. For example, each heart attack patient in the drug trial could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a placebo (fake treatment) and the second group receives the drug. See the case study in Section 1.1 for another example of an experiment, though that study did not employ a placebo. TIP: association ≠ causation In general, association does not imply causation, and causation can only be inferred from a randomized experiment. This page titled 1.2.5: Overview of Data Collection Principles is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.4: Overview of Data Collection Principles by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel is licensed CC BY-SA 3.0. Original source: https://www.openintro.org/book/os. 1.2.5.5 https://stats.libretexts.org/@go/page/35604 1.2.6: Observational Studies and Sampling Strategies Observational Studies Generally, data in observational studies are collected only by monitoring what occurs, what occurs, while experiments require the primary explanatory variable in a study be assigned for each subject by the researchers. Making causal conclusions based on experiments is often reasonable. However, making the same causal conclusions based on observational data can be treacherous and is not recommended. Thus, observational studies are generally only sufficient to show associations. Exercise 1.2.6.1 Suppose an observational study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer? Solution No. See the paragraph following the exercise for an explanation. Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer. One important piece of information that is absent is sun exposure. If someone is out in the sun all day, she is more likely to use sunscreen and more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple investigation. Sun exposure is what is called a confounding variable (also called a lurking variable, confounding factor, or a confounder), which is a variable that is correlated with both the explanatory and response variables. While one method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured. In the same way, the county data set is an observational study with confounding variables, and its data cannot easily be used to make causal conclusions. Exercise 1.2.6.2 Figure 1.9 shows a negative association between the homeownership rate and the percentage of multi-unit structures in a county. However, it is unreasonable to conclude that there is a causal relationship between the two variables. Suggest one or more other variables that might explain the relationship visible in Figure 1.9. Solution Answers will vary. Population density may be important. If a county is very dense, then this may require a larger fraction of residents to live in multi-unit structures. Additionally, the high density may contribute to increases in property value, making homeownership infeasible for many residents. Observational studies come in two forms: prospective and retrospective studies. A prospective study identifies individuals and collects information as events unfold. For instance, medical researchers may identify and follow a group of similar individuals over many years to assess the possible influences of behavior on cancer risk. One example of such a study is The Nurses Health Study, started in 1976 and expanded in 1989. This prospective study recruits registered nurses and then collects data from them using questionnaires. Retrospective studies collect data after events have taken place, e.g. researchers may review past events in medical records. Some data sets, such as county, may contain both rospectively- and retrospectively-collected variables. Local governments prospectively collect some variables as events unfolded (e.g. retails sales) while the federal government retrospectively collected others during the 2010 census (e.g. county population counts). 1.2.6.1 https://stats.libretexts.org/@go/page/35605 Three Sampling Methods Almost all statistical methods are based on the notion of implied randomness. If observational data are not collected in a random framework from a population, these statistical methods are not reliable. Here we consider three random sampling techniques: simple, stratified, and cluster sampling. Figure 1.14 provides a graphical representation of these techniques. Simple random sampling is probably the most intuitive form of random sampling. Consider the salaries of Major League Baseball (MLB) players, where each player is a member of one of the league's 30 teams. To take a simple random sample of 120 baseball players and their salaries from the 2010 season, we could write the names of that season's 828 players onto slips of paper, drop the slips into a bucket, shake the bucket around until we are sure the names are all mixed up, then draw out slips until we have the sample of 120 players. In general, a sample is referred to as "simple random" if each case in the population has an equal chance of being included in the nal sample and knowing that a case is included in a sample does not provide useful information about which other cases are included. Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum. In the baseball salary example, the teams could represent the strata; some teams have a lot more money (we're looking at you, Yankees). Then we might randomly sample 4 players from each team for a total of 120 players. Figure 1.14: Examples of simple random, stratified, and cluster sampling. In the top panel, simple random sampling was used to randomly select the 18 cases. In the middle panel, stratified sampling was used: cases were grouped into strata, and then simple random sampling was employed within each stratum. In the bottom panel, cluster sampling was used, where data were binned into nine clusters, three of the clusters were randomly selected, and six cases were randomly sampled in each of these clusters. Stratified sampling is especially useful when the cases in each stratum are very similar with respect to the outcome of interest. The downside is that analyzing data from a stratified sample is a more complex task than analyzing data from a simple random sample. The analysis methods introduced in this book would need to be extended to analyze data collected using stratified sampling. Example 1.2.6.1 Why would it be good for cases within each stratum to be very similar? Solution We might get a more stable estimate for the subpopulation in a stratum if the cases are very similar. These improved estimates for each subpopulation will help us build a reliable estimate for the full population. 1.2.6.2 https://stats.libretexts.org/@go/page/35605 A cluster sample is much like a two-stage simple random sample. We break up the population into many groups, called clusters. Then we sample a fixed number of clusters and collect a simple random sample within each cluster. This technique is similar to stratified sampling in its process, except that there is no requirement in cluster sampling to sample from every cluster. Stratified sampling requires observations be sampled from every stratum. Figure 1.15: Examples of cluster and multistage sampling. In the top panel, cluster sampling was used. Here, data were binned into nine clusters, three of these clusters were sampled, and all observations within these three cluster were included in the sample. In the bottom panel, multistage sampling was used. It di↵ers from cluster sampling in that of the clusters selected, we randomly select a subset of each cluster to be included in the sample. Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. Also, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another. For example, if neighborhoods represented clusters, then this sampling method works best when the neighborhoods are very diverse. A downside of cluster sampling is that more advanced analysis techniques are typically required, though the methods in this book can be extended to handle such data. Example 1.2.6.3 Suppose we are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia. We learn that there are 30 villages in that part of the Indonesian jungle, each more or less similar to the next. Our goal is to test 150 individuals for malaria. What sampling method should be employed? Solution A simple random sample would likely draw individuals from all 30 villages, which could make data collection extremely expensive. Stratified sampling would be a challenge since it is unclear how we would build strata of similar individuals. However, cluster sampling seems like a very good idea. First, we might randomly select half the villages, then randomly select 10 people from each. This would probably reduce our data collection costs substantially in comparison to a simple random sample and would still give us reliable information. This page titled 1.2.6: Observational Studies and Sampling Strategies is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. 1.5: Observational Studies and Sampling Strategies by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel is licensed CC BY-SA 3.0. Original source: https://www.openintro.org/book/os. 1.2.6.3 https://stats.libretexts.org/@go/page/35605 1.2.7: Experiments Studies where the researchers assign treatments to cases are called experiments. When this assignment includes randomization, e.g. using a coin ip to decide which treatment a patient receives, it is called a randomized experiment. Randomized experiments are fundamentally important when trying to show a causal connection between two variables. Principles of experimental design Randomized experiments are generally built on four principles. Controlling. Researchers assign treatments to cases, and they do their best to control any other differences in the groups. For example, when patients take a drug in pill form, some patients take the pill with only a sip of water while others may have it with an entire glass of water. To control for water consumption, a doctor may ask all patients to drink a 12 ounce glass of water with the pill. Randomization. Researchers randomize patients into treatment groups to account for variables that cannot be controlled. For example, some patients may be more susceptible to a disease than others due to their dietary habits. Randomizing patients into the treatment or control group helps even out such differences, and it also prevents accidental bias from entering the study. Replication. The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. Additionally, a group of scientists may replicate an entire study to verify an earlier nding. Blocking. Researchers sometimes know or suspect that variables, other than the treatment, inuence the response. Under these circumstances, they may rst group individuals based on this variable into blocks and then randomize cases within each block to the treatment groups. This strategy is often referred to as blocking. For instance, if we are looking at the effect of a drug on heart attacks, we might rst split patients in the study into low-risk and high-risk blocks, then randomly assign half the patients from each block to the control group and the other half to the treatment group, as shown in Figure 1.15. This strategy ensures each treatment group has an equal number of low-risk and high-risk patients. It is important to incorporate the rst three experimental design principles into any study, and this book describes applicable methods for analyzing data from such experiments. Blocking is a slightly more advanced technique, and statistical methods in this book may be extended to analyze data collected using blocking. Reducing bias in human experiments Randomized experiments are the gold standard for data collection, but they do not ensure an unbiased perspective into the cause and effect relationships in all cases. Human studies are perfect examples where bias can unintentionally arise. Here we reconsider a study where a new drug was used to treat heart attack patients.17 In particular, researchers wanted to know if the drug reduced deaths in patients. 17 Anturane Reinfarction Trial Research Group. 1980. Sul npyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256. 1.2.7.1 https://stats.libretexts.org/@go/page/35606 Figure 1.16: Blocking using a variable depicting patient risk. Patients are first divided into low-risk and high-risk blocks, then each block is evenly divided into the treatment groups using randomization. This strategy ensures an equal representation of patients in each treatment group from both the low-risk and high-risk categories. These researchers designed a randomized experiment because they wanted to draw causal conclusions about the drug's effect. Study volunteers18 were randomly placed into tw