MD115 Biostatistics: Introduction to Biostatistical Concepts (PDF)

Document Details

FieryBodhran

Uploaded by FieryBodhran

European University Cyprus

Theodore Lytras

Tags

biostatistics data analysis medical research R programming

Summary

This document provides an introduction to biostatistics, specifically focusing on basic concepts, types of data, and formulating an analysis plan. It uses examples, including vaccine effectiveness, to illustrate the importance of statistical methods in medical research. The document also introduces the R programming language as a statistical tool.

Full Transcript

MD115 Biostatistics: 1. Introduction – Basic concepts and types of data; formulating a plan for analysis Theodore Lytras Assistant Professor of Public Health Introduction What is biostatistics? Statistics “A branch of applied mathematics which deals with the collec- tion,...

MD115 Biostatistics: 1. Introduction – Basic concepts and types of data; formulating a plan for analysis Theodore Lytras Assistant Professor of Public Health Introduction What is biostatistics? Statistics “A branch of applied mathematics which deals with the collec- tion, classification, analysis and interpretation of data” Biostatistics “A branch of applied mathematics which deals with the col- lection, classification, analysis and interpretation of data from biomedical research” — “But I’m a medical doctor! Why should I learn biostatistics?” — “Because that’s exactly how medical knowledge is generated!” How medical knowledge is generated: Science (and medicine in particular) is empirical: natural and experimental observations −→ inductive reasoning −→ generalization Basic research ←→ Clinical research Any new medical knowledge is validated in actual people For example: How can we know that the COVID-19 vaccine is effective against symptomatic COVID-19 ? Do an experiment: randomize some people to receive either the vaccine OR placebo, follow them up over some time period, and compare how many of them get symptomatic COVID-19 Use a study sample (the people randomized) to draw inferences about a population (everybody out there) Sample −→ population Can we say that the vaccine is effective, if: out of 10 people who got the vaccine, 3 got COVID-19, and out of 10 people who got placebo, 6 got COVID-19 ?? out of 100 people who got the vaccine, 30 got COVID-19, and out of 100 people who got placebo, 60 got COVID-19 ?? out of 10 people who got the vaccine, 1 got COVID-19, and out of 10 people who got placebo, 8 got COVID-19 ?? There are only two possibilities. Either: 1. The vaccine is effective in reducing your risk of getting COVID-19 2. The vaccine is NOT effective in reducing your risk of getting COVID-19, and just by chance we got this difference Sample −→ population Can we say that the vaccine is effective, if: out of 10 people who got the vaccine, 3 got COVID-19, and out of 10 people who got placebo, 6 got COVID-19 ?? out of 100 people who got the vaccine, 30 got COVID-19, and out of 100 people who got placebo, 60 got COVID-19 ?? out of 10 people who got the vaccine, 1 got COVID-19, and out of 10 people who got placebo, 8 got COVID-19 ?? The larger the study sample, and/or the more extreme the difference, the more likely this is to be true and not due to chance This is the kind of questions statistics deals with! Clinical research Remember: we are working with a sample to infer about a population of interest It is the population that we’re really interested in – not the sample We need to clearly distinguish between population and sample Sample quantities (e.g. sample mean) are known, i.e. measured, whereas population quantities (e.g. population mean) are unknown and are being estimated Appropriate sample selection is essential, as is accurate data measurement – otherwise bias ensues We will discuss bias next year, in Epidemiology class... Assuming unbiased samples and accurate measurements, statistics allows us to: Convert our data into meaningful results (analyze our data) Let us know how likely our results are to reflect real differences, or be due to chance (random error) How is clinical research done? Study design Data collection Variable types, Enter data in a database or type of analyses spreadsheet Data processing Logical and consistency checks, create derived variables, merge datasets, etc Output results Data analysis Calculate result values, create Descriptive analyses, univariate tables and figures, etc / multivariate analyses, etc. Steps 3-5 can be implemented using statistical software, such as the R statistical environment Facilitates reproducible research Reproducible research: a basic principle of good research Definition Ability to reproduce the results for: the investigator himself/herself other collaborating investigators the wider research community A prerequisite is the use of code throughout: Processing raw data Analyzing data and generating results Presenting results in a report Direct link between analysis and final result Nothing done “by hand”! Interactive use vs R scripts Rectangular data The data we are working with are usually rectangular (tabular), like: Patient ID Age Sex Vaccinated? Got COVID-19? 1 52 Female Yes No 2 65 Male Yes No 3 16 Female No Yes 4 62 Female No No............... Unit of observation The unit that is described by the data. Usually the patient. Observation (Record) The rows of the table. A set of values that refer to a particular unit of observation Variable (Field) The columns of the table. A set of values of the same type that reflect a particular characteristic of the units of observation. Has a name. Primary key The observation ID. A variable that uniquely defines a unit of observation. Types of variables Categorical Numeric Nominal Continuous e.g. sex, occupation e.g. temperature (without an inherent ordering) (measurements) Ordinal e.g. educational level Discrete (with inherent ordering) e.g. number of children (counts) Dichotomous e.g. diseased/healthy, yes/no (with just two levels) Different statistical methods are appropriate for different types of data! Types of variables: examples Nominal variables Blood type Occupation Sex Race... Categorical variables with NO inherent ordering Types of variables: examples Ordinal variables Likert scales (e.g. satisfaction level, socioeconomic status, etc): Very poor / Poor / Fair / Good / Very good Educational level: primary school / high school / college / postgraduate Categorical variables with an inherent ordering Types of variables: examples Continuous variables Weight (kg) Body Mass Index (kg/m2 ) Blood pressure (mmHg) Blood cholesterol (mg/dl) Survival time (years)... Measurements (non-countable), accompanied always by some unit of measurement Can convert to other unit of measurement Types of variables: examples Discrete variable Number of children Number of deaths Number of asthma attacks... Counts of things No units of measurement Types of variables One can convert between variable types Categorical Numeric variables variables Nominal Continuous variables variables Ordinal Discrete variables variables Examples Age to age group: class ( a ) [ 1 ] "numeric" NA is a special value in R that represents a missing value > a a [ 1 ] 1 2 NA 4 > is. na ( a ) [ 1 ] FALSE FALSE TRUE FALSE Vector indexing Indexing = selecting elements from a vector or other object Use the indexing operator [ ], in one of four ways 1. Positive integer vector Vector positions of the elements we “keep” 2. Negative integer vector Vector positions of the elements that we exclude 3. Logical vector Of equal length to main vector (or recycled) Positions with TRUE we keep, positions with FALSE we exclude 4. Character vector The names of our data vector (if defined) Indexing is ubiquitous in R ! Loading datasets into R (reading them into data.frames) R can import any kind of data (CSV files, other statistical packages, databases, Excel worksheets, etc) Probably the simplest way: from Excel files > install. packages ( "readxl" ) > library ( readxl ) > dat

Use Quizgecko on...
Browser
Browser