Module 1 Part 1: Basic Statistical and Biostatistical Terms PDF

Module 1 in the United States. When 30 of Part 1: Basic Statistical and Biostatistical them were randomly selected and Terms Importance of Biostatistics and tested, it was found that 12 of them Health Statistics...

Module 1 Part 1: Basic Statistical and Biostatistical Terms Importance of Biostatistics and Health Statistics Definitions Data are collections of observations, such as measurements or survey responses. A single data value is called a “datum”, a term rarely used. The term “data” is plural so it is correct to say “data are..”not “data is…” Is statistics is the science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them. A population is the complete collection of all measurements or the data that are being considered. Typically the population is the complete collection of would like to make inferences about. A census is the collection of data from every member of the population. A sample is a subcollection of members selected from a population. Example In the journal article Carbon Monoxide Detector Failure Rates in the United States" and Arnold, American Journal of Public Health, Vol. 101, was stated that there carbon monoxide detectors PART 2: Analyze Graph the Data Explore the Data Are there any outliers (numbers very far away from almost all the other data)? What important statistics summarize the data (such as the mean and standard deviation)? How are the data distributed? Are there missing data? Did many selected subjects refuse to respond? Apply Statistical Methods Use technology to obtain results PART 3: Conclude Do the results have statistical significance? Do the results have practical significance? DEFINITIONS A voluntary response sample (or self-selected sample) is one in which respondents themselves decide whether to be included. Statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5% or less. Example: Getting 98 girls in 100 random births Loaded Questions If survey questions are not worded carefully, the results of a study can be misleading. Order of Questions Sometimes survey questions are unintentionally loaded by such factors as the order of the items being considered. Nonresponse A nonresponse occurs when someone either refuses to respond to a survey question or is unavailable. Percentages Some studies cite misleading or unclear percentages. Note that 100% of some all of it, but if there are references made to percentages that exceed 100%, such references are often not justified. BASIC TYPES OF DATA Parameter - is a numerical measurement describing some characteristic of a population. Statistic is a numerical measurement describing some characteristic of a sample. EXAMPLE There are 17,246,372 high school the United States. In a study high school students 16 years of age or older, 44.5% of them said that while driving at least once during the anything, so they are categorical data, not quantitative data. DISCRETE/CONTINUOUS Discrete data result when the data values are quantitative, and the number of values is finite or "countable." many values, the collection of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting tails or the number of births in Houston before getting a male.) Continuous (numerical) data result from infinitely many possible quantitative values, where the collection of values is not countable. (That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as the lengths of distances from 0 cm to 12 cm.) EXAMPLES Discrete Data of the Finite Type: Each of several physicians plans to count the number of physical examinations given during the next full week. The data are discrete data because they are finite numbers, such as 27 and 46 that result from a counting process. Discrete Data of the Infinite Type: Researchers plan to test the accuracy of a blood typing test by repeating the process of submitting a sample of the same blood (Type O+) until the test yields an error. It is possible that each researcher could repeat this test coded as 1, “I disagree” is coded as 2, “I don’t care” is coded as 3; "I refuse to answer" is coded as 4; "Go away and stop bothering me" is coded as 5. The numbers 1, 2, 3, 4,5 don't measure or count anything. ORDINAL LEVEL (in order) Data are at the ordinal level of measurement if they can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless. EXAMPLE: Course Grades: A biostatistics professor assigns grades of A, B, C, D, or F. These grades can be arranged in order, but we can't determine differences between the grades. For example, we know that A is higher than B (so there is an ordering), but we cannot subtract B from A (so the difference cannot be found). LEVELS OF MEASUREMENT INTERVAL LEVEL Data are at the interval level of measurement if they can be arranged in order, and differences between data values can be found and are meaningful; but data at this level do Heights of Students: Heights of 180 cm and 90 cm for a high school student and a preschool student (0 cm represents no height, and 180 cm is twice as tall as 90 cm.) Class Times: The times of 50 min and 100 min for a statistics class (0 min represents no class time, and 100 min is twice as long as 50 min.) BIG DATA Big data refers to data sets and so complex that their beyond the capabilities software tools. Analysis may require software simultaneously running in parallel on many different computers. Data science involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as biology and epidemiology). MISSING DATA A data value is missing completely random if the likelihood missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value. A data value is missing not at random if the missing value is related to the reason that it is missing. showing that increases in ice cream sales are associated with increases in drownings). The mistake is to miss the lurking variable of temperature and the failure to see that as the temperature increases, ice cream sales increase and drownings increase because more people swim. Experiment: Conduct an experiment with one group treated with ice cream while another group gets no ice cream. We would see that the rate of drowning victims is about the same in both groups, so ice cream consumption has no effect on drownings. Here, the experiment is clearly better than the observational study. Design of Experiments Replication: It is the repetition of an experiment on more than one individual. Good use of replication requires sample sizes that are large enough so that we can see effects of treatments. Blinding: It is used when doesn't know whether he or she is receiving a treatment or placebo. Randomization: It is used when individuals are assigned to different groups through a process of random selection. EXAMPLES Discrete Data of the Finite Type: Each of several physicians plans to count the number of physical examinations given during the next full week. The data are discrete data because they are finite numbers, such In systematic sampling, we select some starting point and then select every kth (such as every 50th) element in the population. With convenience sampling, we simply use data that are very easy to get. Purposive sampling, data that are utilized if you have a purpose. In stratified sampling, we subdivide the population into at least two different subgroups (or strata) so that subjects within the same subgroup share the same characteristics (such as gender). then we draw a sample from each sub group (stratum). In cluster sampling, we first divide the population area into clusters). then we randomly select some of those clusters and the members from those selected clusters. In a multistage sample design, pollsters select a sample in different stages, and each stage might use different methods of sampling. OBSERVATIONAL STUDIES IN A CROSS-SECTIONAL STUDY, DATA ARE OBSERVED. MEASURED, AND COLLECTED AT ONE POINT IN TIME, NOT OVER A PERIOD OF TIME. (Present) IN A RETROSPECTIVE (OR CASE-CONTROL) STUDY, DATA ARE COLLECTED FROM A PAST TIME PERIOD BY GOING BACK IN TIME (THROUGH EXAMINATION OF RECORDS, Rigorously Controlled Design: Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment. This can be extremely difficult to implement, and often we can never be sure that we have accounted for all of the relevant factors. Sampling Errors A sampling error (or random sampling error) occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result: such an error results from chance sample fluctuations. A nonsampling error is the result of human error, including as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances. A nonrandom sampling error is the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample. Lesson 3: Importance of Biostatistics and Health Statistics What is Biostatistics? Biostatistics is the branch of statistics responsible observational studies, longitudinal studies, and genomics: Clinical trials: Studying the evaluation of treatments, screening, and prevention methods in populations Epidemiological: Studying the causes and origins of disease in humans Human Genetics: Studying the genetic differences associated with diseases and disease states Genomics: Studying the biological activity of genes as they relate to diseases and treatments Spatial Studies: Studying the geographical distribution of disease/risk factors Although the work of these scientists is complex, their responsibilities include: Designing and conducting experiments related to health, emergency management, and safety Collecting and analyzing data to improve current public health programs and identify problems and solutions in the public health sector Interpreting the results of their findings The validity of their research results depends on how well they can make meaningful generalizations and how well they can reproduce and apply experimental methods. What is Informatics? Informatics, which is actually an emerging field, is also known as bioinformatics, a science that basic disciplines of science, Incorporating bioinformatics/biostatistics into efficient and automated data tools Developing and tracking quality workflow metrics for detecting variants and sequences Working with scientists and researchers to develop project plan in the United States. When 30 of them were randomly selected and tested, it was found that 12 of them failed to provide an alarm in hazardous carbon monoxide conditions. In this case, the population and sample are as follows: Population: All 38 million carbon monoxide detectors in the United States Sample: The 30 carbon monoxide detectors that were selected and tested The objective is to use the sample data as a basis for drawing a conclusion about the population of all carbon monoxide detectors, and methods of statistics are helpful in drawing such conclusions. PROCESS INVOLVED IN A STATISTICAL STUDY PART 1: Prepare data that we Context What do the data represent? What is the goal of study? Source of the Data Are the data from a source with a special interest so that there is pressure to obtain results that are favorable to the source? Sampling Method "Residential Were the data collected in a way that is unbiased, or were (by Ryan the data collected in a way that is biased (such as a No. 10), it procedure in which are 38 million respondents volunteer to installed participate? significant because such an extreme outcome is not likely to result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur with random chance. Practical significance is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical. (i.e. hilot and herbals) ANALYZING DATA: POTENTIAL PITFALLS Here are few items that could cause problems when analyzing data. Misleading Conclusions When forming a conclusion based on a statistical analysis, we should make statements that are clear even to those who have no understanding of statistics and its terminology. We should carefully avoid making statements not justified by the statistical analysis. Sample Data Reported Instead of Measured When collecting data from people, it is better to take measurements yourself instead of asking subjects to report results. is statistically previous 30 days (based on data in Texting While Driving and Other Risky Motor Vehicle Behaviors Among High School Students,* by Olsen, Shults, Eaton, Pediatrics, Vol. 131, No. 6). Parameter: The population size of all 17,246,372 high school students is a parameter, because it is the size of the entire population of all high school students in the United States. If we somehow knew the percentage of all 17,246,372 high school students who reported they had texted while driving, that percentage would also be a parameter. Statistic: The value of 44.5% is a statistic, because it is based on the sample, not on the entire population. QUANTITATIVE/CATEGORICAL quantity is Quantitative (or numerical) data consist of numbers representing counts or measurements. Categorical (or qualitative or attribute) data consist of names or labels (not numbers that represent counts or measurements). EXAMPLES Quantitative Data: The ages (in years) of subjects enrolled in a clinical trial Categorical Data as Labels: The genders (male/female) of subjects enrolled in a clinical trial Categorical Data as Numbers: The students in identification numbers 1, 2, 3,.., 25 of 8505 U.S. are assigned randomly to the 25 subjects in a clinical trial. Those they texted numbers are substitutes for names. They do not measure or count forever without ever getting an error, but they can still count the number of tests as they proceed. The collection of the numbers of tests is countable, because you can count them, even though the counting could go on forever. Continuous Data: When the typical (If there are infinitely patient has blood drawn as part of a routine examination, the volume of blood drawn is between 0 mL and 50 mL. There are infinitely many values between 0 mL and 50 mL. Because it is impossible to count the number of different possible values on such a continuous scale, these amounts are continuous data. Lesson 2: Basic Statistical and Biostatistical Terms LEVELS OF MEASUREMENT NOMINAL LEVEL The nominal level of measurement is characterized by data that consist of names, labels, or categories only. It is not possible to arrange the data in some order (such as low to high). EXAMPLES: Yes/No/Undecided: Survey responses of yes, no, and undecided Coded Survey Responses: for an item on a survey respondents are given a choice of possible answers, and they are coded as follows: “I agree” is coded as not have a natural zero starting point at which none of the quantity is present. EXAMPLES: Temperatures: Body temperatures of 98.2°F and 98.8°F are examples of data at this interval level of measurement. Those values are ordered, and we can determine their difference of 0.6°F. However, there is no natural starting point. The value of 0°F might seem like a starting point, but it is arbitrary and does not represent the total absence of heat. Years: The years 1492 and 1776 can be arranged in order, and the difference of 284 years can be found and is meaningful. However, time did not begin in the year 0, so the year 0 is arbitrary instead of being a natural zero starting point representing "no time." RATIO LEVEL Data are at the ratio level of measurement it they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point (where zero indicates that none of the quantity is present). For data at this level, differences and ratios are both meaningful. EXAMPLE: Different Methods of Correcting Missing Data Delete Cases: One very common method for dealing with missing data is to delete all subjects having any missing values. Impute Missing Values: We impute missing data values when we substitute values for them. There are different methods of determining the replacement values, such as using the mean of the other values, or using a randomly selected value from other similar cases, or using a method based on regression analysis. so large analysis is BASICS OF DESIGN OF of traditional EXPERIMENTS of big data The Gold Standard: Randomization with placebo/treatment groups is sometimes called the "gold standard" because it is so effective. (A placebo such as sugar pill has no medicinal effect.) In an experiment, we apply some treatment and then proceed to observe its effects on the individuals. (The individuals in experiments are called experimental units, and they are often called subjects when they at are people.) of its being In an observational study, we observe and measure specific characteristics, but we do not attempt to modify the individuals being studied. EXAMPLES Observational Study: Observe past data to conclude that ice cream causes drownings (based on data as 27 and 46 that result from a counting process. Discrete Data of the Infinite Type: Researchers plan to test the accuracy of a blood typing test by repeating the process of submitting a sample of the same blood (Type O+) until the test yields an error. It is possible that each researcher could repeat this test forever without ever getting an error, but they can still count the number of tests as they proceed. The collection of the numbers of tests is countable, because you can count them, even though the counting could go on forever. Continuous Data: When the typical patient has blood drawn as part of a routine examination, the volume of blood drawn is between 0 mL and 50 mL. There are infinitely many values between 0 mL and 50 mL. Because it is impossible to count the number of different possible values on such a continuous scale, these amounts are continuous data. the subject Collecting Sample Data A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen. (A simple random sample is often called a random sample, but strictly speaking, a random sample has the weaker requirement that all members of the population have the same chance of being selected. That distinction is not so important in this text.) INTERVIEWS, AND SO ON). (Past) IN A PROSPECTIVE (OR LONGITUDINAL OR COHORT) STUDY, DATA ARE COLLECTED IN THE FUTURE FROM GROUPS THAT SHARE COMMON FACTORS (SUCH GROUPS ARE CALLED COHORTS). (Future) Experiments In a study, cofounding occurs when we can see some effect, but we can't identify the specific factor that caused it. Completely Randomized Experimental Design: Assign subjects sections (or to different treatment groups through a choose all process of random selection. Randomized Block Design: A block is a group of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment. Use the following procedure: Form blocks (or groups) of subjects with similar characteristics; and randomly assign treatments to the subjects within each block. Matched Pairs Design: Compare two treatment groups (such as treatment and placebo) by using subjects matched in pairs that are somehow related or have similar characteristics. the scientific data that is generated in the health sciences, including the public health sphere. It is the responsibility of biostatisticians and other experts to consider the variables in subjects (in public health, subjects are usually patients, communities, or populations), to understand them, and to make sense of different sources of variation. The goal of biostatistics is to disentangle the data received and make valid inferences that can be used to solve problems in public health. Biostatistics uses the application of statistical methods to conduct research in the areas of biology, public health, and medicine. Many times, experts in biostatistics collaborate with other scientists and researchers. such factors The Role of Biostatisticians Biostatisticians are said to be the specialists of data evaluation, as it is their expertise that allows them to take complex, mathematical findings of clinical trials and research-related data and translate them into valuable information that is used to make public health decisions. The work of biostatisticians is also required in government agencies and legislative offices, where research is often used to influence change at the policy-making level. In short, these professionals use mathematics to enhance science and bridge the gap between theory and practice. Biostatisticians are required to develop statistical methods for clinical trials, for interpreting probability and statistics, and computer science to build a solid statistical foundation for making advances, improvements, and even breakthroughs in public health and medicine. Health informatics is often said to meet at the intersection of information science, computer science, and healthcare, as it deals with the resources, devices, and methods required for the effective storage, use, and retrieval of information, while public health informatics includes the application of informatics in public health areas, such as surveillance, prevention, preparedness, and health promotion. Public health informatics focuses on information and technology issues from the perspective of groups of individuals. Naturally, health informatics tools would include computers, making systems analysts important members of public health informatics research teams. It is the responsibility of expert informaticists to systematically apply information, computer science, and technology into research, learning, and the practice of public health. The Role of Systems Analysts in Informatics Systems analysts are called upon to write and troubleshoot the software used by biostatisticians and researchers. Their Work may also include conducting their own research, designing databases, and developing algorithms for processing and analyzing information. The main responsibilities of systems analysts in biostatistics and informatics relies on the include: mathematics, analysis

Module 1 Part 1: Basic Statistical and Biostatistical Terms PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue