Biostatistics Lesson 1 PDF - Humanitas University
Document Details
Uploaded by StunningHedgehog
Humanitas University
2023
Stefanos Bonovas
Tags
Summary
This document is a lesson on biostatistics, specifically on summarizing data. The lesson covers organizing data, types of variables (nominal, ordinal, interval, and ratio), frequency distributions, and properties of these distributions. It also touches on central location, spread, and skewness.
Full Transcript
Stefanos Bonovas, MD, MSc, PhD Associate Professor of Medical Statistics Department of Biomedical Sciences Humanitas University Course: Biostatistics Lesson 1. March 28, 2023 Introduction to biostatistics: summarizing data Organizing Data Are you conducting …a routine surveillance? …an outbrea...
Stefanos Bonovas, MD, MSc, PhD Associate Professor of Medical Statistics Department of Biomedical Sciences Humanitas University Course: Biostatistics Lesson 1. March 28, 2023 Introduction to biostatistics: summarizing data Organizing Data Are you conducting …a routine surveillance? …an outbreak investigation? …an epidemiological study? You must first compile information in an organized manner. One common method is to create a database. The next table is a simple database (a line-listing) from an epidemiologic investigation of a cluster of hepatitis A cases. It is organized like a spreadsheet with rows and columns. Organizing Data Each row is called an observation (or record) and represents one person. Each column is called a variable and contains information about one characteristic of the individuals (e.g., race, gender, or date of birth). The first column or variable of a database usually contains the person’s name, initials, or identification number. Other columns might contain demographic information, clinical details, exposures possibly related to illness, etc. Organizing Data A variable can be any characteristic that differs from person to person (e.g., height, gender, exposure or disease status, physical activity pattern, etc.) The value of a variable is the number or descriptor that applies to a particular person (e.g., 168 cm of height, female gender, active smoker, never vaccinated for varicella, etc.). Types of Variables For certain variables, the values are numeric; for others, the values are descriptive. The type of values influence the way in which the variables can be summarized/analyzed. Variables can be classified into one of four types, depending on the type of scale used to characterize their value. Types of Variables 1. A nominal-scale variable is one whose values are categories without any numerical ranking, such as country of residence. In epidemiology, nominal variables with only two categories are very common: alive or dead, ill or well, vaccinated or unvaccinated, did or did not smoke, etc. A nominal variable with two mutually exclusive categories is sometimes called a dichotomous variable. Types of Variables 1. A nominal-scale variable is one whose values are categories without any numerical ranking, such as country of residence. In epidemiology, nominal variables with only two categories are very common: alive or dead, ill or well, vaccinated or unvaccinated, did or did not smoke, etc. A nominal variable with two mutually exclusive categories is sometimes called a dichotomous variable. 2. An ordinal-scale variable has values that can be ranked, but are not necessarily evenly spaced, such as stage of cancer… Types of Variables 3. An interval-scale variable is measured on a scale of equally spaced units, but without a true zero point, such as date of birth. Types of Variables 3. An interval-scale variable is measured on a scale of equally spaced units, but without a true zero point, such as date of birth. 4. A ratio-scale variable is an interval variable with a true zero point, such as height in centimeters, systolic blood pressure in mm Hg, or duration of illness in days... Types of Variables Nominal- and ordinal-scale variables are considered qualitative or categorical variables, whereas interval- and ratio-scale variables are considered quantitative or continuous variables. * Sometimes the same variable can be measured using both a nominal scale and a ratio scale. For example, the tuberculin skin tests of a group of persons can be measured as “positive” or “negative” (nominal scale) or in millimeters of induration (ratio scale). Interval Interval Nominal Interval Nominal Ratio Interval Nominal Ratio Nominal Interval Nominal Ratio Nominal Ratio Frequency Distributions Variables can be summarized into tables called frequency distributions. A frequency distribution displays the values that a variable can take, and the number of persons or records with each value. Properties of Frequency Distributions The data in a frequency distribution can be graphed. We call this type of graph a histogram. Even a quick look at this graph reveals three features: - where the distribution has its peak (central location), - how widely dispersed it is on both sides of the peak (spread), - whether it is more or less symmetrically distributed on the 2 sides of the peak (shape). Central location The clustering at a particular value is known as the central location or central tendency of a frequency distribution. The central location of a distribution is one of its most important properties. Organizing Data Three measures of central location are often used in epidemiology: mean, median, mode. Depending on the shape of the frequency distribution, all measures of central location can be identical or different (see next page). Spread A second property of frequency distribution is spread (also called variation or dispersion). Spread refers to the distribution out from a central value. Two measures of spread commonly used in epidemiology are range and standard deviation. Shape A third property of a frequency distribution is its shape. The graphs of the three theoretical frequency distributions in the previous figure were completely symmetrical. When distribution is asymmetrical, it is more commonly referred to as skewed. * Skewness refers to the tail, not the hump. So a distribution that is skewed to the left has a long left tail. Shape A distribution that has a central location to the left and a tail off to the right is said to be positively skewed or skewed to the right. Shape A distribution that has a central location to the right and a tail to the left is said to be negatively skewed or skewed to the left. The Normal or Gaussian distribution This distribution deserves special mention. It is the classic symmetrical bell-shaped curve. Measures of Central Location Measure of central location is a single, usually central, value that best represents a distribution of data. Measures of central location include the mode, the median, and the mean. Selecting the best measure to use for a given distribution depends largely on two factors: - the shape or skewness of the distribution, and - the intended use of the measure. Mode The mode is the value that occurs most often in a set of data. It can be determined simply by tallying the number of times each value occurs. 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4 What is the mode? Median The median is the middle value of a set of data that has been put into rank order. The median is the value that divides the data into two halves, with one half of the observations being smaller than the median value and the other half being larger. The median is also the 50th percentile of the distribution Median How to identify the median? Arrange the observations into increasing or decreasing order. Find the middle position of the distribution by using the formula: Middle position = (n + 1) / 2 Median How to identify the median? Arrange the observations into increasing or decreasing order. Find the middle position of the distribution by using the formula: Middle position = (n + 1) / 2 If the number of observations (n) is odd, the middle position falls on a single observation (the median equals the value of it). If the number of observations is even, the middle position falls between two observations (the median equals the average of these two values). Arithmetic mean The arithmetic mean is a more technical name for what is more commonly called the mean or average. The arithmetic mean is the best descriptive measure for data that are normally distributed. On the other hand, the mean is not the measure of choice for data that are severely skewed or have extreme values ("outliers") in one direction or another. In such a case, the median is the measure of choice. Selecting the appropriate measure Measures of central location are single values that summarize the observed values of a distribution. The mode provides the most common value, the median provides the central value, the arithmetic mean provides the average value. The mode and median are useful as descriptive measures. However, they are not often used for statistical evaluations. In contrast, the mean is not only a good descriptive measure, but it also has good statistical properties. Selecting the appropriate measure Measures of central location are single values that summarize the observed values of a distribution. The mode provides the most common value, the median provides the central value, the arithmetic mean provides the average value. The mode and median are useful as descriptive measures. However, they are not often used for statistical evaluations. In contrast, the mean is not only a good descriptive measure, but it also has good statistical properties. While the arithmetic mean is the measure of choice when data are normally distributed, the median is the measure of choice for data that are not normally distributed Selecting the appropriate measure When epidemiological data tend not to be normally distributed, the median is often preferred. The mean uses all the data. It is sensitive to outliers. The mode and median tend not to be affected by outliers. In summary, the selection of the most appropriate measure requires judgment based on the characteristics of the data (e.g. normally distributed or skewed, with or without outliers) and the reason for calculating the measure (e.g. for descriptive or analytical purposes). Measures of Spread Measures of spread include: - the range, - the interquartile range (IQR), and - the standard deviation (SD). The range of a set of data is the difference between its largest (maximum) value and its smallest (minimum) value. In the epidemiological community, the range is usually reported as “from (the minimum) to (the maximum)” that is, as two numbers rather than one. Measures of Spread Percentiles divide the data in a distribution into 100 equal parts. The Pth percentile (P ranging from 0 to 100) is the value that has P percent of the observations falling at or below it. In other words, the 90th percentile has 90% of the observations at or below it. Measures of Spread Percentiles divide the data in a distribution into 100 equal parts. The Pth percentile (P ranging from 0 to 100) is the value that has P percent of the observations falling at or below it. In other words, the 90th percentile has 90% of the observations at or below it. The median is the 50th percentile. The maximum value is the 100th percentile, because all values fall at or below the maximum. Measures of Spread Percentiles divide the data in a distribution into 100 equal parts. The Pth percentile (P ranging from 0 to 100) is the value that has P percent of the observations falling at or below it. In other words, the 90th percentile has 90% of the observations at or below it. The median is the 50th percentile. The maximum value is the 100th percentile, because all values fall at or below the maximum. Quartiles. Sometimes, epidemiologists group data into 4 equal parts, or quartiles. Each quartile includes 25% of the data. The cut-off for the first quartile is the 25th percentile. The cut-off for the second quartile is the 50th percentile, which is the median. The cut-off for the third quartile is the 75th percentile. And the cutoff for the fourth quartile is the 100th percentile, which is the max. Measures of Spread Interquartile range (IQR). The interquartile range is a measure of spread used most commonly with the median. It represents the central portion of the distribution, from the 25th percentile to the 75th percentile. The IQR includes the 2nd and 3rd quartiles of a distribution. The IQR includes approximately one half of the observations. Presentation of the asymmetrically distributed data using a box-plot. Example: finding the IQR 0,2,3,4,5,5,6,7,8,9,9,9,10,10,10,10,10,11,12,12,12,13,14,16,18,18,19,22,27 Find the position of the 1st and 3rd quartiles. Position of Q1 = (n+1)/4 = (29+1)/4 = 7.5 Position of Q3 = 3(n+1)/4 = 3*(29+1)/4 = 22.5 Thus, Q1 lies between the 7th and 8th observations, and Q3 lies between the 22nd and 23rd observations. Identify the value of the 1st and 3rd quartiles (Q1 and Q3). Q1 = (6+7)/2 = 6.5 Q3 = (13+14)/2 = 13.5 IQR = 6.5 to 13.5 Properties and uses of the IQR The IQR is generally used in conjunction with the median. Together, they are useful for characterizing the central location and spread of any frequency distribution, but particularly those that are skewed (asymmetrical)… Standard Deviation (SD) The standard deviation is the measure of spread used most commonly with the arithmetic mean. Standard Deviation (SD) The standard deviation is the measure of spread used most commonly with the arithmetic mean. Method for calculation: Calculate the arithmetic mean. Subtract the mean from each observation. Square the difference. Sum the squared differences. Divide the sum of the squared differences by n–1. Take the square root of the value obtained. The result is the standard deviation. Properties and uses of SD SD is calculated only when the data are more-or-less normally distributed, i.e. data fall into a bell-shaped curve. For normally distributed data, the mean is the recommended measure of central location, and the SD is the recommended measure of spread. For normally distributed data, approximately two-thirds (68.3%) of the data fall within one SD of either side of the mean; 95.5% of the data fall within two SDs of the mean; and 99.7% of the data fall within three SDs. Exactly 95.0% of the data fall within 1.96 SDs of the mean. Standard error (se) of the mean The standard deviation (SD) is sometimes confused with another measure with a similar name, the standard error (se) of the mean. However, the two are not the same. The standard deviation describes variability in a set of data. The standard error of the mean refers to the variability we might expect in the means of repeated samples taken from the same population. Standard error (se) of the mean The standard deviation (SD) is sometimes confused with another measure with a similar name, the standard error (se) of the mean. However, the two are not the same. The standard deviation describes variability in a set of data. The standard error of the mean refers to the variability we might expect in the means of repeated samples taken from the same population. Method for calculation: Calculate the standard deviation. Divide the standard deviation by the square root of the number of observations (n). Exercise When the serum cholesterol levels of 4,462 men were measured, the mean cholesterol level was 213, with a standard deviation of 42. Calculate the standard error of the mean for the serum cholesterol level of the men studied. Exercise When the serum cholesterol levels of 4,462 men were measured, the mean cholesterol level was 213, with a standard deviation of 42. Calculate the standard error of the mean for the serum cholesterol level of the men studied. Answer Standard error = 42 divided by the square root of 4,462 = 0.629 Properties and uses of the standard error of the mean The primary practical use of the standard error (se) of the mean is in calculating…. confidence intervals (confidence limits) around the mean. Confidence limits (confidence interval) Epidemiologists conduct studies not only to measure characteristics in the subjects studied, but also to make generalizations about the larger population from which these subjects came. This process is called inference. A common way to indicate a measurement’s precision is by providing a confidence interval (CI). A narrow CI indicates high precision; a wide CI indicates low precision. CIs are often calculated for the mean. CIs can also be calculated for some of the epidemiologic measures, such as a proportion, risk ratio, and odds ratio. Confidence limits (confidence interval) How to calculate a 95% confidence interval for a mean: Calculate the mean and its standard error (SE). Multiply the SE by 1.96. Lower limit of the 95% CI = mean – 1.96 x SE Upper limit of the 95% CI = mean + 1.96 x SE Properties and uses of confidence intervals The mean is not the only measure for which a CI can be calculated. CIs are also often calculated for proportions, rates, risk ratios, odds ratios, and other epidemiologic measures when the purpose is to draw inferences from a study to the population. CIs for means, proportions, risk ratios, odds ratios, and other measures all are calculated using different formulas. Regardless of the measure, the interpretation of a CI is the same: the narrower the interval, the more precise the estimate… 20 20 15 20 15 6.7 Choosing the right measure of central location & spread Measures of central location and spread are useful to summarize a distribution of data. However, not every measure of central location and spread is well suited to every set of data. Choosing the right measure of central location & spread Bibliography: Principles of Epidemiology in Public Health Practice. An Introduction to Applied Epidemiology and Biostatistics. U.S. Department of Health and human Services, Centers for Disease Control and Prevention (CDC). https://www.cdc.gov/csels/dsepd/ss1978/SS1978.pdf