Module 1 Part 1: Basic Statistical and Biostatistical Terms PDF
Document Details
Uploaded by SophisticatedTimpani
Tags
Summary
This document discusses basic statistical and biostatistical terms, including definitions of data, statistics, populations, samples, and census. It covers the process involved in a statistical study, including context, data source, and sampling methods. The content also explores analyzing data, practical significance, and definitions.
Full Transcript
Module 1 in the United States. When 30 of Part 1: Basic Statistical and Biostatistical them were randomly selected and Terms Importance of Biostatistics and tested, it was found that 12 of them Health Statistics...
Module 1 in the United States. When 30 of Part 1: Basic Statistical and Biostatistical them were randomly selected and Terms Importance of Biostatistics and tested, it was found that 12 of them Health Statistics failed to provide an alarm in hazardous carbon monoxide Definitions conditions. In this case, the population and sample are as Data are collections of observations, follows: such as measurements or survey Population: All 38 million carbon responses. A single data value is monoxide detectors in the United called a “datum”, a term rarely used. States The term “data” is plural so it is Sample: The 30 carbon monoxide correct to say “data are..”not “data detectors that were selected and is…” tested Is statistics is the science of planning The objective is to use the sample studies and experiments; obtaining data as a basis for drawing a data; and organizing, summarizing, conclusion about the population of presenting, analyzing, and all carbon monoxide detectors, and interpreting those data and then methods of statistics are helpful in drawing conclusions based on them. drawing such conclusions. A population is the complete collection of all measurements or the PROCESS INVOLVED IN A data that are being considered. STATISTICAL STUDY Typically the population is the PART 1: Prepare complete collection of data that we would like to make inferences about. Context A census is the collection of data What do the data represent? from every member of the What is the goal of study? population. Source of the Data A sample is a subcollection of Are the data from a source members selected from a population. with a special interest so that there is pressure to obtain results that are favorable to Example the source? Sampling Method In the journal article "Residential Were the data collected in a Carbon Monoxide Detector Failure way that is unbiased, or were Rates in the United States" (by Ryan the data collected in a way and Arnold, American Journal of that is biased (such as a Public Health, Vol. 101, No. 10), it procedure in which was stated that there are 38 million respondents volunteer to carbon monoxide detectors installed participate? PART 2: Analyze significant because such an extreme outcome is not likely Graph the Data to result from random Explore the Data chance. Are there any outliers Getting 52 girls in 100 births (numbers very far away from is not statistically significant almost all the other data)? because that event could What important statistics easily occur with random summarize the data (such as chance. the mean and standard Practical significance is possible that deviation)? some treatment or finding is How are the data distributed? effective, but common sense might Are there missing data? suggest that the treatment or finding Did many selected subjects does not make enough of a refuse to respond? difference to justify its use or to be Apply Statistical Methods practical. (i.e. hilot and herbals) Use technology to obtain results ANALYZING DATA: POTENTIAL PITFALLS PART 3: Conclude Here are few items that could cause problems when analyzing data. Do the results have statistical significance? Misleading Conclusions Do the results have practical When forming a conclusion significance? based on a statistical analysis, we should make statements DEFINITIONS that are clear even to those A voluntary response sample (or who have no understanding self-selected sample) is one in which of statistics and its respondents themselves decide terminology. We should whether to be included. carefully avoid making Statistical significance is achieved in statements not justified by the a study when we get a result that is statistical analysis. very unlikely to occur by chance. A Sample Data Reported Instead of common criterion is that we have Measured statistical significance if the When collecting data from likelihood of an event occurring by people, it is better to take chance is 5% or less. measurements yourself Example: instead of asking subjects to Getting 98 girls in 100 report results. random births is statistically previous 30 days (based on data in Texting While Driving and Other Risky Motor Loaded Questions Vehicle Behaviors Among High School If survey questions are not Students,* by Olsen, Shults, Eaton, worded carefully, the results Pediatrics, Vol. 131, No. 6). of a study can be misleading. Parameter: The population size of Order of Questions all 17,246,372 high school students Sometimes survey questions is a parameter, because it is the size are unintentionally loaded by of the entire population of all high such factors as the order of school students in the United States. the items being considered. If we somehow knew the percentage Nonresponse of all 17,246,372 high school A nonresponse occurs when students who reported they had someone either refuses to texted while driving, that percentage respond to a survey question would also be a parameter. or is unavailable. Statistic: The value of 44.5% is a Percentages statistic, because it is based on the Some studies cite misleading sample, not on the entire population. or unclear percentages. Note QUANTITATIVE/CATEGORICAL that 100% of some quantity is all of it, but if there are Quantitative (or numerical) data references made to consist of numbers representing percentages that exceed counts or measurements. 100%, such references are Categorical (or qualitative or often not justified. attribute) data consist of names or labels (not numbers that represent BASIC TYPES OF DATA counts or measurements). Parameter - is a numerical EXAMPLES measurement describing some Quantitative Data: The ages (in characteristic of a population. years) of subjects enrolled in a Statistic is a numerical measurement clinical trial describing some characteristic of a Categorical Data as Labels: The sample. genders (male/female) of subjects enrolled in a clinical trial EXAMPLE Categorical Data as Numbers: The There are 17,246,372 high school students in identification numbers 1, 2, 3,.., 25 the United States. In a study of 8505 U.S. are assigned randomly to the 25 high school students 16 years of age or subjects in a clinical trial. Those older, 44.5% of them said that they texted numbers are substitutes for names. while driving at least once during the They do not measure or count anything, so they are categorical forever without ever getting an error, data, not quantitative data. but they can still count the number of tests as they proceed. The collection DISCRETE/CONTINUOUS of the numbers of tests is countable, because you can count them, even Discrete data result when the data though the counting could go on values are quantitative, and the forever. number of values is finite or Continuous Data: When the typical "countable." (If there are infinitely patient has blood drawn as part of a many values, the collection of values routine examination, the volume of is countable if it is possible to count blood drawn is between 0 mL and 50 them individually, such as the mL. There are infinitely many values number of tosses of a coin before between 0 mL and 50 mL. Because it getting tails or the number of births is impossible to count the number of in Houston before getting a male.) different possible values on such a Continuous (numerical) data result continuous scale, these amounts are from infinitely many possible continuous data. quantitative values, where the collection of values is not countable. Lesson 2: Basic Statistical and (That is, it is impossible to count the Biostatistical Terms individual items because at least some of them are on a continuous LEVELS OF MEASUREMENT scale, such as the lengths of distances from 0 cm to 12 cm.) NOMINAL LEVEL The nominal level of EXAMPLES measurement is characterized Discrete Data of the Finite Type: by data that consist of names, Each of several physicians plans to labels, or categories only. count the number of physical It is not possible to arrange examinations given during the next the data in some order (such full week. The data are discrete data as low to high). because they are finite numbers, such EXAMPLES: as 27 and 46 that result from a Yes/No/Undecided: Survey counting process. responses of yes, no, and Discrete Data of the Infinite Type: undecided Researchers plan to test the accuracy Coded Survey Responses: for of a blood typing test by repeating an item on a survey the process of submitting a sample of respondents are given a the same blood (Type O+) until the choice of possible answers, test yields an error. It is possible that and they are coded as each researcher could repeat this test follows: “I agree” is coded as coded as 1, “I disagree” is not have a natural zero starting point coded as 2, “I don’t care” is at which none of the quantity is coded as 3; "I refuse to present. answer" is coded as 4; "Go EXAMPLES: away and stop bothering me" Temperatures: Body is coded as 5. The numbers 1, temperatures of 98.2°F and 2, 3, 4,5 don't measure or 98.8°F are examples of data count anything. at this interval level of ORDINAL LEVEL (in order) measurement. Those values Data are at the ordinal level are ordered, and we can of measurement if they can determine their difference of be arranged in some order, 0.6°F. However, there is no but differences (obtained by natural starting point. The subtraction) between data value of 0°F might seem like values either cannot be a starting point, but it is determined or are arbitrary and does not meaningless. represent the total absence of EXAMPLE: heat. Course Grades: A Years: The years 1492 and biostatistics professor assigns 1776 can be arranged in grades of A, B, C, D, or F. order, and the difference of These grades can be arranged 284 years can be found and is in order, but we can't meaningful. determine differences However, time did not begin between the grades. For in the year 0, so the year 0 is example, we know that A is arbitrary instead of being a higher than B (so there is an natural zero starting point ordering), but we cannot representing "no time." subtract B from A (so the difference cannot be found). RATIO LEVEL Data are at the ratio level of LEVELS OF MEASUREMENT measurement it they can be arranged in order, differences can be found INTERVAL LEVEL and are meaningful, and there is a natural zero starting point (where Data are at the interval level of zero indicates that none of the measurement if they can be arranged quantity is present). For data at this in order, and differences between level, differences and ratios are both data values can be found and are meaningful. meaningful; but data at this level do EXAMPLE: Heights of Students: Heights Different Methods of Correcting Missing of 180 cm and 90 cm for a Data high school student and a Delete Cases: One very common preschool student (0 cm method for dealing with missing data represents no height, and 180 is to delete all subjects having any cm is twice as tall as 90 cm.) missing values. Class Times: The times of 50 Impute Missing Values: We impute min and 100 min for a missing data values when we statistics class (0 min substitute values for them. There are represents no class time, and different methods of determining the 100 min is twice as long as replacement values, such as using the 50 min.) mean of the other values, or using a randomly selected value from other BIG DATA similar cases, or using a method based on regression analysis. Big data refers to data sets so large and so complex that their analysis is BASICS OF DESIGN OF beyond the capabilities of traditional EXPERIMENTS software tools. Analysis of big data may require software simultaneously The Gold Standard: Randomization running in parallel on many different with placebo/treatment groups is sometimes computers. called the "gold standard" because it is so Data science involves applications of effective. (A placebo such as sugar pill has statistics, computer science, and no medicinal effect.) software engineering, along with some other relevant fields (such as In an experiment, we apply some biology and epidemiology). treatment and then proceed to observe its effects on the individuals. (The individuals MISSING DATA in experiments are called experimental units, and they are often called subjects when they A data value is missing completely at are people.) random if the likelihood of its being missing is independent of its value or In an observational study, we observe and any of the other values in the data measure specific characteristics, but we do set. That is, any data value is just as not attempt to modify the individuals being likely to be missing as any other data studied. value. A data value is missing not at EXAMPLES random if the missing value is Observational Study: Observe past related to the reason that it is data to conclude that ice cream missing. causes drownings (based on data showing that increases in ice cream as 27 and 46 that result from a sales are associated with increases in counting process. drownings). The mistake is to miss Discrete Data of the Infinite Type: the lurking variable of temperature Researchers plan to test the accuracy and the failure to see that as the of a blood typing test by repeating temperature increases, ice cream the process of submitting a sample of sales increase and drownings the same blood (Type O+) until the increase because more people swim. test yields an error. It is possible that Experiment: Conduct an experiment each researcher could repeat this test with one group treated with ice forever without ever getting an error, cream while another group gets no but they can still count the number of ice cream. We would see that the rate tests as they proceed. The collection of drowning victims is about the of the numbers of tests is countable, same in both groups, so ice cream because you can count them, even consumption has no effect on though the counting could go on drownings. forever. Here, the experiment is clearly better Continuous Data: When the typical than the observational study. patient has blood drawn as part of a routine examination, the volume of Design of Experiments blood drawn is between 0 mL and 50 Replication: It is the repetition of an mL. There are infinitely many values experiment on more than one between 0 mL and 50 mL. Because it individual. Good use of replication is impossible to count the number of requires sample sizes that are large different possible values on such a enough so that we can see effects of continuous scale, these amounts are treatments. continuous data. Blinding: It is used when the subject doesn't know whether he or she is Collecting Sample Data receiving a treatment or placebo. Randomization: It is used when A simple random sample of n individuals are assigned to different subjects is selected in such a way groups through a process of random that every possible sample of the selection. same size n has the same chance of EXAMPLES being chosen. (A simple random Discrete Data of the Finite Type: sample is often called a random Each of several physicians plans to sample, but strictly speaking, a count the number of physical random sample has the weaker examinations given during the next requirement that all members of the full week. The data are discrete data population have the same chance of because they are finite numbers, such being selected. That distinction is not so important in this text.) In systematic sampling, we select INTERVIEWS, AND SO ON). some starting point and then select (Past) every kth (such as every 50th) IN A PROSPECTIVE (OR element in the population. LONGITUDINAL OR COHORT) With convenience sampling, we STUDY, DATA ARE COLLECTED simply use data that are very easy to IN THE FUTURE FROM GROUPS get. THAT SHARE COMMON Purposive sampling, data that are FACTORS (SUCH GROUPS ARE utilized if you have a purpose. CALLED COHORTS). (Future) In stratified sampling, we subdivide the population into at least two Experiments different subgroups (or strata) so that In a study, cofounding occurs when we subjects within the same subgroup can see some effect, but we can't share the same characteristics (such identify the specific factor that caused as gender). then we draw a sample it. from each sub group (stratum). Completely Randomized In cluster sampling, we first divide Experimental Design: Assign subjects the population area into sections (or to different treatment groups through a clusters). then we randomly select some of those clusters and choose all process of random selection. the members from those selected Randomized Block Design: A block is clusters. a group of subjects that are similar, but In a multistage sample design, blocks differ in ways that might affect pollsters select a sample in different the outcome of the experiment. Use the stages, and each stage might use following procedure: Form blocks (or different methods of sampling. groups) of subjects with similar characteristics; and randomly assign OBSERVATIONAL STUDIES treatments to the subjects within each IN A CROSS-SECTIONAL block. STUDY, DATA ARE OBSERVED. Matched Pairs Design: Compare two MEASURED, AND COLLECTED treatment groups (such as treatment and AT ONE POINT IN TIME, NOT placebo) by using subjects matched in OVER A PERIOD OF TIME. pairs that are somehow related or have (Present) similar characteristics. IN A RETROSPECTIVE (OR CASE-CONTROL) STUDY, DATA ARE COLLECTED FROM A PAST TIME PERIOD BY GOING BACK IN TIME (THROUGH EXAMINATION OF RECORDS, Rigorously Controlled Design: the scientific data that is generated in Carefully assign subjects to different the health sciences, including the treatment groups, so that those given public health sphere. It is the responsibility of biostatisticians and each treatment are similar in the ways other experts to consider the that are important to the experiment. variables in subjects (in public This can be extremely difficult to health, subjects are usually patients, implement, and often we can never be communities, or populations), to sure that we have accounted for all of understand them, and to make sense the relevant factors. of different sources of variation. Sampling Errors The goal of biostatistics is to disentangle the data received and A sampling error (or random make valid inferences that can be sampling error) occurs when the used to solve problems in public sample has been selected with a health. Biostatistics uses the random method, but there is a application of statistical methods to discrepancy between a sample result conduct research in the areas of and the true population result: such biology, public health, and medicine. an error results from chance sample Many times, experts in biostatistics fluctuations. collaborate with other scientists and A nonsampling error is the result of researchers. human error, including such factors as wrong data entries, computing The Role of Biostatisticians errors, questions with biased Biostatisticians are said to be the wording, false data provided by specialists of data evaluation, as it is their respondents, forming biased expertise that allows them to take complex, conclusions, or applying statistical mathematical findings of clinical trials and methods that are not appropriate for research-related data and translate them into the circumstances. valuable information that is used to make A nonrandom sampling error is public health decisions. The work of the result of using a sampling biostatisticians is also required in method that is not random, such as government agencies and legislative offices, using a convenience sample or a where research is often used to influence voluntary response sample. change at the policy-making level. In short, these professionals use Lesson 3: Importance of Biostatistics and mathematics to enhance science and bridge Health Statistics the gap between theory and practice. What is Biostatistics? Biostatisticians are required to develop Biostatistics is the branch of statistical methods for clinical trials, statistics responsible for interpreting observational studies, longitudinal studies, probability and statistics, and computer and genomics: science to build a solid statistical foundation Clinical trials: Studying the for making advances, improvements, and evaluation of treatments, screening, even breakthroughs in public health and and prevention methods in medicine. populations Health informatics is often said to meet at Epidemiological: Studying the the intersection of information science, causes and origins of disease in computer science, and healthcare, as it deals humans with the resources, devices, and methods Human Genetics: Studying the required for the effective storage, use, and genetic differences associated with retrieval of information, while public health diseases and disease states informatics includes the application of Genomics: Studying the biological informatics in public health areas, such as activity of genes as they relate to surveillance, prevention, preparedness, and diseases and treatments health promotion. Public health informatics Spatial Studies: Studying the focuses on information and technology geographical distribution of issues from the perspective of groups of disease/risk factors individuals. Although the work of these scientists is Naturally, health informatics tools would complex, their responsibilities include: include computers, making systems analysts important members of public health Designing and conducting informatics research teams. It is the experiments related to health, responsibility of expert informaticists to emergency management, and safety systematically apply information, computer Collecting and analyzing data to science, and technology into research, improve current public health learning, and the practice of public health. programs and identify problems and solutions in the public health sector The Role of Systems Analysts in Interpreting the results of their Informatics findings Systems analysts are called upon to write The validity of their research results and troubleshoot the software used by depends on how well they can make biostatisticians and researchers. Their Work meaningful generalizations and how may also include conducting their own well they can reproduce and apply research, designing databases, and experimental methods. developing algorithms for processing and analyzing information. What is Informatics? Informatics, which is actually an The main responsibilities of systems emerging field, is also known as analysts in biostatistics and informatics bioinformatics, a science that relies on the include: basic disciplines of science, mathematics, Incorporating bioinformatics/biostatistics into efficient and automated data analysis tools Developing and tracking quality workflow metrics for detecting variants and sequences Working with scientists and researchers to develop project plan