BSPH123 Biostatistics Introduction and Descriptive Biostatistics January 2024 PDF
Document Details
University of Lusaka
2024
Prof. Eustarckio Kazonga, PhD
Tags
Summary
This document is a course outline for a biostatistics course, covering topics like introduction to biostatistics, descriptive statistics, measures of central tendency, and measures of dispersion. It has examples of qualitative and quantitative measurements, and discusses the role of biostatistics in clinical medicine and public health.
Full Transcript
BSPH123 – BIOSTATISTICS Prof. Eustarckio Kazonga, PhD 30thJanuary 2024 E-mail: [email protected] n Course Outline n Key Definitions n Uses of Biostatistics n Variables and Scales of Measurements n Descriptive Statistics Presentation Outline Course Outline Biostatistics Key...
BSPH123 – BIOSTATISTICS Prof. Eustarckio Kazonga, PhD 30thJanuary 2024 E-mail: [email protected] n Course Outline n Key Definitions n Uses of Biostatistics n Variables and Scales of Measurements n Descriptive Statistics Presentation Outline Course Outline Biostatistics Key Definitions Variables and Descriptives Course n 3.1 Introduction n 3.1.1 Introduction to Biostatistics n 3.1.2 Uses of Biostatistics 3.1.3 Types of Variables n 3.2 Descriptive Statistics n 3.2.1 Frequency Tables n 3.2.2 Graphs and Histograms n 3.2.3 Bar charts and Pie Charts n 3.2.4 Shapes of Frequency Distributions n 3.3 Measures of Central Tendency n 3.3.1 Mean, Median and Mode n 3.3.2 Selection of Appropriate Measures of Central Tendency n 3.4 Measures of Dispersion n 3.4.1 Inter-quartile range n 3.4.2 Degrees of Freedom n 3.4.3 Variance and Standard Deviation n 4. Introduction to Probability Theory n 5. Normal Distribution n 5.1 Characteristics of a Normal Distribution n 5.2 Uses and applications n 5.3 Standard score n 6. Experimental Designs n 7. Sampling Designs n 8. Design of data collection and sampling n instruments n 9. Data collection in the field n 10.Qualitative and quantitative methods of n data analysis n 11.Basic statistics computing n 11.1 Introduction to computers n 11.2 Statistical software: SPSS, SAS, Epi- Info, STATA n 11.3 Data entry using Epi-data n 11.4 Data processing using SPSS n 11.5 Use of other computer software n 12. Analyse data using statistical software, interpret outputs and present the results n 12.1 Standardised Normal Deviate (SND) Z-test n 12.1.1 Z-test for one sample n 12.1.2 Confidence interval for population mean n 12.1.3 Z-test for two samples 12.1.4 Confidence interval for difference between two population means n 12.2 Student t-test n 12.2.1 T-test for one sample n 12.2.2 Confidence interval for one sample n 12.2.3 T-test for two independent samples n 12.2.4 Confidence interval for difference of two means n 12.3 Correlation Coefficient and Simple Linear Regression n 12.3.1 Measures of correlation n 12.3.2 Interpretation of correlation coefficient n 12.3.3 Linear regression n 12.3.4 Interpretation of regression coefficient n 12.3.4 Confidence interval for the slope n 12.3.5 Assumptions for Biostatistical testing n 12.3.6 Hypotheses testing and confidence intervals n 12.3.7 Tests of significance and post-hoc: ANOVA n 13.General linear models n 14.Survival analysis n 15.Non-parametric tests n 16.Surveys and sampling ASSESSMENT n Continuous Assessment (Mid-Semester Examination and Practical Assignment) 30% n Final Examinations 70% n 1. Wayne W. Daniel & Chad L. Cross (2013). Biostatistics: A Foundation for Analysis in the Health Sciences, 10th edition. John Wiley & Sons, Inc. ISBN 978-1-118-30279-8 n 2. Bernard Rosner (2016). Fundamentals of Biostatistics. 8 t h Edition, Cengage Learning, Boston. ISBN: 978-1-305-26892-0 n 1. Chap T. LE. (2003). Introductory Biostatistics. John Wiley & Sons, Inc., Hoboken, New Jersey. n 2. Geoffrey R. Norman & David L. Sreiner (2014). Biostatistics. The Bare Essentials, 4 th Edition. People’s Medical Publishing House USA Ltd, ISBN -13: 978-1607951780, I S B N-10: 1607951789 Brooks/Cole. Definition of Biostatistics n Bio means involving life or living organisms. n Statistics is a science that deals with collection of data, and then organising, summarising, presenting, analysing, interpreting, and drawing conclusions. n Therefore Biostatistics means statistics applied to life, health sciences or medical sciences. USES OF BIOSTATISTICS Identify health trends that lead to life-saving measures through the application of statistical procedures, techniques, and methodology Monitoring and evaluating health programmes and policies Assurance, to make certain that necessary services are provided to reach the desired goals determined by policy measures Assessment, to identify problems related to the health of populations and determine their extent Shaping the procedure of clinical trials Role of Biostatistics in Clinical Medicine The main theory of biostatistics lies in the term variability. There are No two individuals who are the same. For example, blood pressure of person may vary from time to time as well as from person to person. We can also have instrumental variability as well as observers variability. Biostatistical methods try to quantify the uncertainties present in medical science. It helps the researcher to arrive at a scientific judgment about a hypothesis. 20 Role of Statistics in Public Health and Community Medicine If reliable information regarding the disease is available, the public health administrator is in a position to: v Assess community needs v Understand socio-economic determinants of health v Plan experiment in health research v Analyse their results v Study diagnosis and prognosis of the disease for taking effective action v Scientifically test the efficacy of new medicines and methods of treatment. 21 Why we need to study Medical Statistics? Three reasons: (1) Basic requirement of medical research. (2) Update your medical knowledge. (3) Data management and treatment. 22 Role of Biostatisticians v To guide the design of an experiment or survey prior to data collection v To analyse data using proper statistical procedures and techniques v To present and interpret the results to researchers and other decision makers 23 Key Biostatistics Concepts Population: It is the largest collection of values of a random variable for which we have an interest at a particular time. A measurement obtained from a population is a PARAMETER For example: The weights of all the children enrolled in a certain elementary school. Populations may be finite or infinite. It is a part of a population. A measurement obtained from the sample is known as a STATISTIC. For example: The weights of only a fraction of these children. Census n The count of a given population (or other phenomena of interest) and record its characteristics, done at a specific point in time and usually at regular intervals by a government entity (or any other entity) for the geographic area or subareas under its domain. n Examples include Population Census, Housing Census, Agriculture Census etc. (ZSA, 2018). Descriptive and Inferential Statistics vDescriptive Statistics deal with the enumeration, organization and graphical representation of data from a sample vInferential Statistics deal with reaching conclusions from incomplete information, that is, generalizing from the specific sample. vI n f e r e n t i a l s t a t i s t i c s u s e a v a i l a b l e information in a sample to draw inferences about the population from which the sample Variable, Value and Observation n Observation unit upon which measurements are made, e.g., person, place, or thing n Variable the [generic] characteristic being measured, e.g., AGE, HIV status n Value a realised measurement, e.g., an age of “27”, a “positive” HIV test Variable n A characteristic that changes values. Measurement n Measurement ≡ the assigning of numbers and codes according to prior- set rules (Stevens, 1946). n Three main types of measurements: Categorical (nominal) Ordinal Quantitative (scale) Scales of Measurement Measurements and Variables Variables and Scales of Measurement Discrete Types of Data 1. Categorical: (e.g. Sex, Marital Status, income category) 2. Continuous: (e.g. Age, income, weight, height, time to achieve an outcome) 3. Discrete: (e.g. Number of Children in a family) 4. B i n a r y o r D i c h o t o m o u s : ( e. g. , response to all ‘Yes’ or ‘No’ type of questions) 35 1. Qualitative n 1.1 Nominal (categorical) e.g., Sex, race, Religion n 1.2 Ordinal n The ordinal scale is used to arrange (or rank) items into a sequence ranging from the highest to lowest. n Grade A+, A, B+, B, C+, C, D, F n 1st , 2nd , 3rd etc 2. Quantitative 2.1 Interval Constant size interval between adjacent units Interval refers to the third level of measurement in relation to complexity of statistical techniques used to analyse data. It is quantitative in nature The individual units are equidistant from one point to the other. The interval data does not have an absolute zero e.g. temperature is measured in Celsius or Fahrenheit. 2. Quantitative Cont’d 2.2 Ratio – Constant size interval between adjacent units – True zero starting point (ratios have meaning) Equal distances between the increments This scale has an absolute zero. Ratio variables exhibit the characteristics of ordinal and interval measurement e.g. variable like time, length and weight are ratio scales and also be measured using nominal or ordinal scale. Categorical Measurements Classify observations into named categories Examples HIV status (positive or negative) SEX (male or female) BLOOD PRESSURE classified as hypo-tensive, normo-tensive, borderline hypertensive, or hypertensive Ordinal Measurements Categories that can be put in rank order Examples: n STAGE OF CANCER classified as stage I, stage II, stage III, stage IV n OPINION classified as strongly agree (5), agree (4), neutral (3), disagree (2), strongly disagree (1); so-called Liekert scale Sources of Public Health Data n Medical Records n Ministries of Health and Home Affairs - Records of Births and Deaths and Annual Reports n Health Surveys n Zambia Demographic and Health Surveys (ZDHS) n Living Conditions Monitoring Survey n National Census of Population and Housing Lecture 2 Descriptive Biostatistics February 2024 1. Measures of Central Tendency 2. Measures of Dispersion 3. Tables 4. Charts and Diagrams 1. Measures of Central Tendency n A measure of central tendency or central location is a descriptive statistic that describes the average, or typical value of a set of scores. n A statistic of central tendency tells you where the middle of a set of measurements is. n The arithmetic mean is by far the most common, but the median, geometric mean, and harmonic mean are sometimes useful. 1. Measures of Central Tendency … n There are three common measures of central tendency: the mode the median the mean Geometric Mean Harmonic Mean The Mean … n While the arithmetic mean is by far the most commonly used statistic of central tendency, you should be aware of a few others. n The arithmetic mean is the sum of the observations divided by the number of observations. n It is the most common statistic of central tendency, and when someone says simply "the mean" or "the average," this is what they mean. n The arithmetic mean works well for values that fit the normal distribution. n It is sensitive to extreme values, which makes it not work well for data that are highly skewed. Sample Mean where x refers to the mid-points of the groups and f refers to the frequency of each group (i.e. the numbers associated with each group). =180/26 6.92 (2.dp) Class Interval Frequency (f) 1-5 1 6-10 3 11-15 8 16-20 10 21-25 20 26-30 15 31-35 12 36-40 5 41-45 4 46-50 2 TOTAL 80 Class Interval Midpoint (x) Frequency (f) fx 1-5 3 1 3 6-10 8 3 24 11-15 13 8 104 16-20 18 10 180 21-25 23 20 460 26-30 28 15 420 31-35 33 12 396 36-40 38 5 190 41-45 43 4 172 46-50 48 2 96 TOTAL 80 2045 2045 80 25.56 Characteristics of Mean n Uniqueness - For a given set of data there is one and only one mean n Simplicity - The mean is easy to c a l c u l a t e. Affected by extreme values The mean is influenced by each value. Therefore, extreme values can distort the mean. When To Use the Mode n The mode is not a very useful measure of central tendency It is insensitive to large changes in the data set –That is, two data sets that are very different from each other can have the same mode where L is the lower class limit of the modal class f1 is the frequency of the modal class f0 is the frequency of the class before the modal class in the frequency table f2 is the frequency of the class after the modal class in the frequency table h is the class interval of the modal class Number Frequency 1-3 7 4-6 6 7-9 4 10 - 12 2 13 - 15 2 16 - 18 8 19 - 21 1 22 - 24 2 25 - 27 3 28 - 30 2 TOTAL 37 Example n Question: Calculate the Mode for the given distribution. n Answer: n Modal Class = 16-18 n L =16 n f1 = 8 n f0 = 2 n f2 = 1 n h=3 Number Frequency 1-3 7 4-6 6 7-9 4 10 - 12 2 13 - 15 2 16 - 18 8 19 - 21 1 22 - 24 2 25 - 27 3 28 - 30 2 TOTAL 37 Cumulative Number Frequency Frequency 1-3 7 7 4-6 6 13 7-9 4 17 10 - 12 2 19 13 - 15 2 21 16 - 18 8 29 19 - 21 1 30 22 - 24 2 32 25 - 27 3 35 28 - 30 2 37 Appropriate Measures Of Central Tendency n For example, suppose that you have four 10 km segments to your automobile trip. You drive your car: n 100 km/hr for the first 10 km n 110 km/hr for the second 10 km n 90 km/hr for the third 10 km n 120 km/hr for the fourth 10 km. n What is the average speed? n In symmetrical distributions, the median and mean are equal For normal distributions, mean = median = mode n In positively skewed distributions, the mean is greater than the median In negatively skewed distributions, the mean is smaller than the median Measures of Non-Central Locations Quartiles Quintiles Deciles Percentiles 100 2. Measures of Dispersion n A measure of dispersion conveys information regarding the amount of variability present in a set of data. n Note: 1. If all the values are the same → There is no dispersion. 2. If all the values are different → There is a dispersion: 3.If the values close to each other →The amount of Dispersion small. b) If the values are widely scattered → The Dispersion is greater. Measures of Dispersion are : 1.Range (R) 2. Variance (S2 ) 3. Standard deviation (S) 4.Coefficient of variation (C.V) 1.The Range (R): Range =Largest value- Smallest value = x L x S n Note: n Range concern only onto two values n Example n Data: n 43,66,61,64,65,38,59,57,57,50. n Find Range? n Range=66-38=28 n It measure dispersion relative to the scatter of the values a bout their mean. a) Sample Variance ( S 2 ) : n , n 2 (x i x ) 2 S i1 n 1 n where x is sample mean n Example n Find Sample Variance of ages , = 56 n Solution: n S2= [(43-56) 2 +(66-56) 2+…..+(50-56) 2 ]/ 10 n 2 (x i x) 2 S i 1 n 1 43,66,61,64,65,38,59,57,57,50 X X- (X- )2 43 -13 169 66 10 100 61 5 25 64 8 64 65 9 81 38 -18 324 59 3 9 57 1 1 57 1 1 50 -6 36 Total 810 Using the Formula n 2 (x x) i 2 S i 1 n 1 2 810 S 10 1 Variance (S2)= 90 ∴ Standard Deviation (S) = 9.5 Computational Formula X 2 X X 2 2 N 2 N N 2 is the population variance, X is a score, is the population mean, and N is the number of scores X X2 X- (X-)2 9 81 2 4 8 64 1 1 6 36 -1 1 5 25 -2 4 8 64 1 1 6 36 -1 1 = 42 = 306 =0 = 12 X 2 X 2 N 2 X 2 2 N N 12 306 42 2 6 6 6 2 306 294 6 12 6 2 X = arithmetic mean n = sample size Xi = ith value of the variable X f = frequency QUESTION 1 A sample of twenty-three animals used for clinical trials were weighed in grams. The following are the weights recorded: 56 19 24 45 17 15 34 36 29 43 35 52 19 46 18 19 32 31 54 27 39 21 18 Calculate the Standard Deviation 2 Varince 2 S 2 4.The Coefficient of Variation (C.V): n Is a measure use to compare the dispersion in two sets of data which is independent of the unit of the measurement. n S where S = Sample standard C.V (100) X deviation. n X : Sample mean. n Suppose two samples of human males yield the following data: Sampe1 Sample2 Age 25-year-olds 11year-olds Mean weight 89kg 56kg Standard deviation 5kg 5kg n We wish to know which is more variable. n Solution: n c.v (Sample1)= (5/89)*100= 0.05617 or 5.62%? n c.v (Sample2)= (5/56)*100= 0.08475 or 8.48%? n Which sample is more variable Measure of Skewness nSkewness is a measure of symmetry in the distribution of scores nFormula 1. Sk = (Mean – Mode)/S nFormula 2. Sk = 3(Mean – Median)/S Measure of Skewness X X 3 N s 3 X X 2 N Measure of Skewness Kurtosis Kurtosis nWhen the distribution is normally distributed, its kurtosis equals 3 and it is said to be mesokurtic nWhen the distribution is less spread out than normal, its kurtosis is greater than 3 and it is said to be leptokurtic nWhen the distribution is more spread out than normal, its kurtosis is less than 3 and it is said to be platykurtic Measure of Kurtosis 4 XX X X 2 N s4 N s2, s3, & s4 nCollectively, the: nstandard deviation(s) nvariance (s2), nskewness (s3), and nkurtosis (s4) describe the shape of the distribution Charts and Diagrams n A picture is worth a thousand words. n Embedding a chart, illustration, table, graph, m a p , p h o t o g ra p h , o r o t h e r n o n -t e x t u a l element into your research paper can bring added clarity to a study because it provides a clean, concise way to report findings that would otherwise take several long [and boring to read] paragraphs to describe. n Non-textual elements are useful tools for summarising information, especially when you have a great deal of data to present. n Non-textual elements help the reader grasp a large amount of data quickly and in an orderly fashion. n Non-textual elements are visually engaging. n Using a chart or photograph, for example, can help enhance the overall presentation of your work and provide a way to stimulate a reader's interest in the study. drawing that illustrates or visually explains a thing or idea by outlining its component parts and the relationships among them. n A form bounded by three or more lines; one or more digits or numerical symbols representing a number. n A two-dimensional drawing showing a relationship [usually between two set of numbers] by means of a line, curve, a series of bars, or other symbols. n Typically, an independent variable is represented on the horizontal line (X-axis) and an dependent variable on the vertical line (Y-axis). n The perpendicular axis intersect at a point called origin, and are calibrated in the units of the quantities represented. #1-8-9 Figure 1. Trends in HIV prevalence among pregnant women in Country X, years 1 – 10 40 30 % 20 10 Year 0 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Source: STD/AIDS Control Programme, Uganda (2001) HIV/AIDS Surveillance Report Year MMR 1960 50 1970 45 1980 26 1990 15 2000 12 Figure (1): Maternal mortality rate of (country), 1960-2000 n A picture is worth a thousand words. n Embedding a chart, illustration, table, graph, map, photograph, or other non-textual element into your research paper can bring added clarity to a study because it provides a clean, concise way to report findings that would otherwise take several long [and boring to read] paragraphs to describe. n Non-textual elements are useful tools for summarising information, especially when you have a great deal of data to present. n Non-textual elements help the reader grasp a large amount of data quickly and in an orderly fashion. n Non-textual elements are visually engaging. n Using a chart or photograph, for example, can help enhance the overall presentation of your work and provide a way to stimulate a reader's interest in the study. drawing that illustrates or visually explains a thing or idea by outlining its component parts and the relationships among them. n A form bounded by three or more lines; one or more digits or numerical symbols representing a number. n A two-dimensional drawing showing a relationship [usually between two set of numbers] by means of a line, curve, a series of bars, or other symbols. n Typically, an independent variable is represented on the horizontal line (X-axis) and an dependent variable on the vertical line (Y-axis). n The perpendicular axis intersect at a point called origin, and are calibrated in the units of the quantities represented. #1-8-9 Figure 1. Trends in HIV prevalence among pregnant women in Country X, years 1 – 10 40 30 % 20 10 Year 0 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Source: STD/AIDS Control Programme, Uganda (2001) HIV/AIDS Surveillance Report Year MMR 1960 50 1970 45 1980 26 1990 15 2000 12 Figure (1): Maternal mortality rate of (country), 1960-2000 Frequency Distribution Tables n This is a method of presenting data and the number of occurrences. The table comprises data and corresponding occurrences. EXAMPLE Prepare a frequency table with class sizes of 10 starting with “11 – 20” FREQUENCY TABLE Frequency Table n Generally, the first approach to examining your data. n Identifies distribution of variables overall n Identifies potential outliers Investigate outliers as possible data entry errors Investigate a sample of others for data entry errors 153 A research study has been conducted examining the number of children in the families living in a community. The following data has been collected based on a random sample of n = 30 families from the community. 2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0, 5, 8, 6, 5, 4 , 2, 4, 4, 7, 6 Organize this data in a Frequency Table! 154 X=No. of Count Relative Freq. Children (Frequency) 0 2 2/30=0.067 1 3 3/30=0.100 2 5 5/30=0.167 3 5 5/30=0.167 4 6 6/30=0.200 5 4 4/30=0.133 6 2 2/30=0.067 7 2 2/30=0.067 8 1 1/30=0.033 155 Age Sex Mid-point of interval (years) Males Females 20 - 3 (12%) 2 (10%) (20+30) / 2 = 25 30 - 9 (36%) 6 (30%) (30+40) / 2 = 35 40- 7 (8%) 5 (25%) (40+50) / 2 = 45 50 - 4 (16%) 3 (15%) (50+60) / 2 = 55 60 - 70 2 (8%) 4 (20%) (60+70) / 2 = 65 Total 25(100%) 20(100%) Sex Age M-P M F 20- (12%) (10%) 25 30- (36%) (30%) 35 40- (8%) (25%) 45 50- (16%) (15%) 55 60-70 (8%) (20%) 65 Figure (2): Distribution of 45 patients at (place) , in (time) by age and sex Frequency curve 9 8 Female 7 Male 6 Frequency 5 4 3 2 1 0 20- 30- 40- 50- 60-69 Age in ye ars n Histograms are a special form of bar chart where the data represent continuous rather than discrete categories. n Therefore, bars are connected. Composite or Component Bar Chart 120 100 80 60 40 20 0 2019 2020 Female Male Male 14% 36% 22% 28% Jan-12 Jul-12 Jan-13 Jul-13 20 40 60 80 100 A A 20 40 60 80 100 1 89 2 27999 3 05556678899 4 0457778899 5 000001122334444566666677899 6 00000222245566899 7 01122223335689 8 000111166788 9 00 Age (years) Frequency 10 5 11 10 12 27 13 18 14 6 15 16 16 38 17 9 Cumulative Age (years) Frequency Frequency 10 5 5 11 10 5+10 = 15 12 27 15+27 = 42 13 18 42+18 = 60 14 6 60+6 = 66 15 16 66+16 = 82 16 38 82+38 = 120 17 9 120+9 = 129 Thank You! 2. Inferential Biostatistics The Tools of Classical Inference: a. Estimation b. Confidence Intervals c. P-values d. Hypothesis Tests