BMS 511 Statistical Analysis - Chapter 2 PDF
Document Details
Uploaded by .keeks.
Marian University
2018
Guang Xu
Tags
Summary
This document is a chapter from a statistics course discussing numerical descriptors, measures of center (mean and median), and measures of spread (quartiles and standard deviation). The chapter includes examples and the process for identifying outliers.
Full Transcript
BMS 511 Statistical Analysis Chapter 2 Numerical descriptors Guang Xu, PhD, MPH Assistant Professor of Biostatistics and Public Health College of Osteopathic Medicine Marian University Previous Learning Objectives Deter...
BMS 511 Statistical Analysis Chapter 2 Numerical descriptors Guang Xu, PhD, MPH Assistant Professor of Biostatistics and Public Health College of Osteopathic Medicine Marian University Previous Learning Objectives Determine & Apply Picturing Distributions with Graphs Individuals and variables Two types of data: categorical and quantitative Ways to chart categorical data: bar graphs and pie charts Ways to chart quantitative data: histograms and dotplots Interpreting histograms Graphing time series: time plots Copyright © 2018 W. H. Freeman and Company Learning Objectives Describing distributions with numbers Measures of center: mean and median Measures of spread: quartiles and standard deviation The five-number summary and boxplots IQR and outliers Dealing with outliers Choosing among summary statistics Organizing a statistical problem Copyright © 2018 W. H. Freeman and Company Measure of center: the mean The mean, or arithmetic average To calculate the average (mean) of a data set, add all values, then divide by the number of individuals. It is the “center of mass.” Copyright © 2018 W. H. Freeman and Company Measure of center: the median (1 of 2) The median is the midpoint of a distribution—the number such that half of the observations are smaller, and half are larger. Copyright © 2018 W. H. Freeman and Company Measure of center: the median (2 of 2) 1) Sort observations from smallest to largest. n= number of observations 2) The location of the median is (n+1)/2 in the sorted list If n is odd, the median is the value of the center observation. n=25 (n+1)/2=13 Median =3.4 If n is even, the median is the mean of the two center observations. n=24 (n+1)/2=12.5 Median = (3.3+3.4)/2 =3.35 Copyright © 2018 W. H. Freeman and Company Comparing the mean and the median (1 of 2) The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Copyright © 2018 W. H. Freeman and Company Comparing the mean and the median (2 of 2) Mean and median for skewed distributions Copyright © 2018 W. H. Freeman and Company Example—mean and median (1 of 2) A study of freely forming groups in taverns all over Europe recorded the group size (number of individuals in the group) that were naturally laughing. (There were a total of 501 groups in the study.) The median laughter group size is A) 2 B) 2.5 C) 3 D) 3.5 E) 4 Copyright © 2018 W. H. Freeman and Company Example—mean and median (2 of 2) The average laughter group size is A) smaller than the median. B) about the same as the median. C) larger than the median. Copyright © 2018 W. H. Freeman and Company Measure of spread: quartiles The first quartile, Q1, is the median of the values below the median in the sorted data set. The third quartile, Q3, is the median of the values above the median in the sorted data set. Copyright © 2018 W. H. Freeman and Company Example—quartiles (1 of 2) How fast do skin wounds heal? Here are the skin healing rate data from 18 newts measured in micrometers per hour: 28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35 Sorted data: 11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40 Median = ??? Quartiles = ??? Copyright © 2018 W. H. Freeman and Company Example—quartiles (2 of 2) Copyright © 2018 W. H. Freeman and Company Measure of spread: interquartile range The interquartile range (IQR) is the distance between the first and third quartiles. Because the quartiles are medians themselves (of each half of the data set), the IQR is a resistant statistic. It is possible for the IQR to equal zero, if the values for Q3 and Q1 are equal. The quartiles, the median, the minimum, and the maximum are called the five-number summary. Copyright © 2018 W. H. Freeman and Company Measure of spread: standard deviation (1 of 2) The standard deviation is used to describe the variation around the mean. To get the standard deviation of a SAMPLE of data: 1) Calculate the variance s2 Copyright © 2018 W. H. Freeman and Company Measure of spread: standard deviation (2 of 2) 2) Take the square root to get the standard deviation, s Learn how to obtain the standard deviation of a sample using technology. Copyright © 2018 W. H. Freeman and Company Example—calculating the standard deviation (1 of 2) A person’s metabolic rate is the rate at which the body consumes energy. Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours). Copyright © 2018 W. H. Freeman and Company Example—calculating the standard deviation (2 of 2) Copyright © 2018 W. H. Freeman and Company Features of the standard deviation s measures spread about the mean, and should only be used when the average is the measure of center. s is always zero or greater than zero. s = 0 only when all the values in the sample are identical. s has the same units of measurement as the original observations. s2, the variance, has squared units of the original observations, and is harder to interpret. s, like the mean, is not resistant. Outliers have an even larger effect on s than they do on the mean. Copyright © 2018 W. H. Freeman and Company Graphical displays: boxplots The boxplot is a graphical view of the five-number summary. Copyright © 2018 W. H. Freeman and Company IQR and suspected outliers Recall the interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot). IQR = Q3 – Q1 An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier? – Suspected low outlier: any value < Q1 – 1.5 IQR – Suspected high outlier: any value > Q3 + 1.5 IQR Copyright © 2018 W. H. Freeman and Company Example 1—using IQR to identify outliers Individual #25 has a survival of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5 IQR = 3.225 years. Individual #25 is a suspected outlier. Copyright © 2018 W. H. Freeman and Company Example 2—using IQR to identify outliers Anonymous class survey: weight (lbs) and height (in) were used to compute BMI. Copyright © 2018 W. H. Freeman and Company Dealing with outliers What should you do if you find outliers in your data? It depends in part on what kind of outliers they are: – Human error in recording information – Human error in experimentation or data collection – Unexplainable but apparently legitimate wild observations Are you interested in ALL individuals? Are you interested only in typical individuals? Don’t discard outliers just to make your data look better, and don’t act as if they did not exist. Copyright © 2018 W. H. Freeman and Company Choosing among summary statistics Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers. Plot the mean and use the standard deviation for error bars. Otherwise, use the median and the five-number summary, which can be plotted as a boxplot. Copyright © 2018 W. H. Freeman and Company Example 1—choosing summary statistics Deep-sea sediments Phytopigment concentrations in deep-sea sediments collected worldwide show a very strong right-skew. Which of these two values is the mean and which is the median? – 0.015 and 0.009 grams per square meter of bottom surface Which would be a better summary statistic for these data? Copyright © 2018 W. H. Freeman and Company Example 2—choosing summary statistics (1 of 3) Researchers grafted human cancerous cells onto 20 healthy adult mice. Then 10 of the mice were injected with tumor-specific antibodies (anti-CD47), while the other 10 mice were not (IgG). Here is what a table of the raw data would look like. What summary statistics would you use for each of these two variables? Copyright © 2018 W. H. Freeman and Company Example 2—choosing summary statistics (2 of 3) Mouse Treatment Presence of metastases Number of metastases 1 IgG Yes 1 2 IgG Yes 1 3 IgG Yes 2 4 IgG yes 2 5 IgG yes 2 6 IgG yes 3 7 IgG yes 3 8 IgG yes 3 9 IgG yes 3 10 IgG yes 4 11 anti-CD47 no 0 12 anti-CD47 no 0 13 anti-CD47 no 0 Copyright © 2018 W. H. Freeman and Company Example 2—choosing summary statistics (3 of 3) Mouse Treatment Presence of metastases Number of metastases 14 anti-CD47 no 0 15 anti-CD47 no 0 16 anti-CD47 no 0 17 anti-CD47 no 0 18 anti-CD47 no 0 19 anti-CD47 no 0 20 anti-CD47 yes 1 Copyright © 2018 W. H. Freeman and Company Organizing a statistical problem 1. State: What is the practical question, in the context of a realworld setting? 2. Plan: What specific statistical operations does this problem call for? 3. Solve: Make the graphs and carry out the calculations needed for this problem. 4. Conclude: Give your practical conclusion in the real-world setting. Copyright © 2018 W. H. Freeman and Company Statistics Software SPSS 28.0 (IBM SPSS) Available on computer at computer lab on Evan Center 2F, 3F, and Library Use is similar to excel Software: SPSS The software was released in its first version in 1968 as the Statistical Package for the Social Sciences (SPSS) From version 16.0 the same version runs under Windows, Mac, and Linux. The graphical user interface is written in Java. SPSS Inc was acquired by IBM on July 28, 2009. Because of a dispute about ownership of the name "SPSS", between 2009 and 2010, the product was referred to as PASW (Predictive Analytics SoftWare). Complete transfer of business to IBM was done by October 1, 2010 Software: SPSS Software: SPSS Software: SPSS Statistics Software GraphPad 7.04 GraphPad Prism combines 2D scientific graphing, biostatistics with explanations, and curve fitting via nonlinear regression (Windows and Mac). Marian has a few licenses Graph Instat guides students and scientists through basic biostatistics (Windows) GraphPad StatMate performs power and sample size calculations (Windows). GraphPad QuickCalcs are a set of statistical calculators (Free, web-based). Statistics Software: GraphPad GraphPad Prism Marian is in the process of purchasing a very limited number of licenses (~2-10 licenses) Student License is available on its website: $100/year Graph Instat guides students and scientists through basic biostatistics (Windows, old Mac) No Student License available, but it offers academic license Software: GraphPad Prism Software: GraphPad Prism Software: GraphPad Prism C Control Sub-Lethal CD4+ cells in CD3+ splenocytes (%) 80 Lethal ** ** 60 ** Neurovirulence of Zika in newborn CD1 mice 100 40 PBS 80 Clone 10 FFU Percent survival Clone 100 FFU 20 60 Clone 1000 FFU Dakar 10 FFU 40 0 Dakar 100 FFU D6 D12 D18 20 Dakar 1000 FFU B 4 * *** 0 Liver Pathology 0 5 10 15 3 Day post infection *** Score 2 1 0 Uninfected Infected Uninfected Infected WT CD8-/- Learning Objectives Describing distributions with numbers Measures of center: mean and median Measures of spread: quartiles and standard deviation The five-number summary and boxplots IQR and outliers Dealing with outliers Choosing among summary statistics Organizing a statistical problem Copyright © 2018 W. H. Freeman and Company