Summary

This document provides an introduction to biostatistics, describing what statistics is, its importance in medicine, and the types of variables used. It also covers different types of data and their presentation methods. It references learning objectives, sources of data, and different types of variables that include qualitative and quantitative variables, discrete and continuous.

Full Transcript

Biostatistics Lecture One Introduction to Biostatistics Learning Objectives of this session ▪what is meant by statistics?. ▪ Importance of statistics in Medicine ▪ Types of variables (qualitative and quantitative variables). ▪ what is meant by descriptive statistics and inferential statistic...

Biostatistics Lecture One Introduction to Biostatistics Learning Objectives of this session ▪what is meant by statistics?. ▪ Importance of statistics in Medicine ▪ Types of variables (qualitative and quantitative variables). ▪ what is meant by descriptive statistics and inferential statistics? ▪ Presentation of Data Statistics Is the discipline concerned with the treatment (handling) of numerical data derived from groups of individuals Thus we need to know how to obtain, how to analyse, and how to interpret these information (which are called data) Data are available in the form of numbers (values) Biostatistics: Is that field of statistics in which the data being analysed were derived from the biological sciences and medicine There are two types of statistics: The type in which we are concerned only with collection, organisation, presentation and summarisation of data, is called descriptive statistics The type in which the objective is to reach a decision about a large group of data by examining only a small one, is called inferential (analytical) statistics Data (datum): The raw material of statistics is called data. It is obtained either as a measurement or as a process of counting. Value: It is the numerical representative of the measurement of the variable Sources of data: The need for statistical activities is motivated by the need to answer a question. That needs a search for suitable data to serve as the raw material for the investigation. Such data are usually available in the form of one or more of the following sources: 1\. Routine records, such as hospital medical records 2\. Surveys, if the data needed to answer a question are not available from routine records 3\. Experiments 4\. External sources, in form of published reports, data banks, or the research literature Variable: Any characteristic that can take different values in different occasions, places, persons, and time, e.g. height, weight, age, etc\... Variables are one of two types Quantitative variable (numerical): is that variable that can be measured by units such as height, weight, age, etc\... Qualitative variable (categorical): is that variable that cannot be measured by units. It can only be assessed by number or percentage e.g. sex, ethnic group, colour of the eye, race, education, occupation, type of disease Quantitative variables are of two types: Discrete quantitative variable: characterized by gaps or interruptions in the values These gaps or interruptions indicate the absence of values between the values, e.g. daily admission of patients to hospital, parity or abortion times... etc Continuous quantitative (random) variable: it does not posses the gaps or interruptions characteristic. It has fractions of units, and the variable can assume any value within a specified interval, as height, weight, etc.. In fact, most of the biological data are of the continuous quantitative type Measurements and measurement scales: There is another classification of variables according to measurements or measurement scales. Measurement means the assignment of numbers to objects or events according to a set of rules: these rules include: Nominal scale (male-female, well-sick, under 65 years- 65 and above, child-adult, and married-unmarried) Ordinal scale (high-intermediate-low, not smoker, light, moderate, heavy smoker, Social class I, II, III, IV&V) Interval scale Ratio scale (determine the quality of ratio or discrete) 12/14/2024 24 INSY200 Fall-99 12/14/2024 26 INSY200 Fall-99 Population: It is the largest collection of entities of which we have an interest at a particular time, sharing at least one characteristic in common Sample: The sample may be defined as a part of population, subset of population chosen in a representative way to be as much as possible representative for the population (random, or non-random) The method applied to collect a sample is called sampling Lectures Two & Three Summarisation and presentation of data Data organisation (Ordered array): It is the arrangement of the data according to their magnitude from the smallest to the largest or vice versa. The benefits of ordered array are: Determine the smallest value (Xs) and the largest value (Xl) Determine the range Easy to present the data by table To find the value of median Data presentation: Data presentation is either by: 1-Neumerical (numbers) 2- Tables: as a-Master table b- Simple frequency distribution table c- Class interval frequency distribution table 3- Graphs (Pictorial presentation of data) When we have the data composed of small sample size (n=20) it is easy to present them by numerical (numbers) \"simple data\", while if the data is more than 20 values or observations it is better to present them by tables Master table It contains the information regarding all variables included in the study (spreadsheet in the computer Excel). From master table the information regarding one or two variables will be taken and presented in simple frequency or an other type of tables. Simple frequency distribution table: It is the arrangement of data according to their magnitude and the frequency of occurrence of each magnitude. When we want to complete the table, it is composed of many columns including the values of the variable (X), the frequency of occurrence recurrence (F), the cumulative frequency (Cum.F), the relative frequency (R.F.) or percent % the cumulative relative frequency (C.R.F), as in the following example: Parity Frequency Cum.F R.F. or% C.R.F Primigravida (0) 25 25 0.25 0.25 1 14 39 0.14 0.39 2 16 55 0.16 0.55 3 18 73 0.18 0.73 4 & more 27 100 0.27 1.00 Total 100\--1.00- Table (1): The parity distribution of mothers attending ANC clinic in the Al-karar PHCC for the year 2024 The characteristics of tables: Table should be simple, easy to be understood and self explanatory. Each table should have a number. Each table should have a title written at the top of it. This title should answer the following questions: what, where, when and who. Each table should have a clear heading for the columns. Each table should contain a total at the end of each column. We should avoid the use of abbreviations and codes, and if we have to use them we should refer to them at the bottom of the table. If we use any number from any reference or book: we should refer to it at the bottom of the table. Class-interval frequency distribution table: The data of continuous or discrete quantitative type is presented here as intervals (grouped), the steps to present the data by class interval table is as following: Count the number of observations. Determine the smallest and the largest values. Decide whether to present them in simple or in class interval table. To present them in class interval table we have to determine the number of class intervals according to Sturges\' formula: K=1+3.322 log10 n Then determine the width of class interval W= = = ( Range) K R K ( Xl − Xs ) K Then determine the class interval Then present the frequency of observations according to this class interval by tallying The additional characteristics of class interval tables: The number of class interval (k) should not be less than 5 (in order not to lose the details) and not more than 20 The preferable number of class interval is 6-12 or using Sturges\' formula. Constant width of class interval No gaps in between class intervals No overlapping between class intervals (the observation will be presented once only) Example: Table 3: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2024 Hemoglobin (g/ dL) Tallying Freq. Cum.F R.F. C.R.F. 8-I 1 1 0.014 0.014 9-III 3 4 0.043 0.057 10-IIIII IIIII IIII 14 18 0.2 0.257 11-IIIII IIIII IIIII IIII 19 37 0.27 0.528 12-IIIII IIIII IIII 14 51 0.2 0.728 13-IIIII IIIII IIII 13 64 0.186 0.914 14-IIIII 5 69 0.071 0.985 15-15.9 I 1 70 0.014 1.00 Total 70\--1.00- The haemoglobin level in g/dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2024 10.2 13.7 10.4 14.9 11.5 12.0 11.0 13.3 12.9 12.1 9.4 13.2 10.8 11.7 10.6 10.5 13.7 11.8 14.1 10.3 13.6 12.1 12.9 11.4 12.7 10.6 11.4 11.9 9.3 13.5 14.6 11.2 11.7 10.9 10.4 12.0 12.9 11.1 8.8 10.2 11.6 12.5 13.4 12.1 10.9 11.3 14.7 10.8 13.3 11.9 11.4 12.5 13.0 11.6 13.1 9.7 11.2 15.1 10.7 12.9 13.4 12.3 11.0 14.6 11.1 13.5 10.9 13.1 11.8 12.2 Steps for table creating K=1+30322 log10 n = 1+3.322 X log10 70 =1+3.322 X 1.85 = 1+6.15 = 7.15 = \~ 7 The width of class interval = R (Max-Min)/ K = (15.1-8.8)/ 7 =1 Table 3: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2024 Hemoglobin (g/ dL) Tallying Freq. Cum.F R.F. C.R.F. 8-I 1 1 0.014 0.014 9-III 3 4 0.043 0.057 10-IIIII IIIII IIII 14 18 0.2 0.257 11-IIIII IIIII IIIII IIII 19 37 0.27 0.528 12-IIIII IIIII IIII 14 51 0.2 0.728 13-IIIII IIIII IIII 13 64 0.186 0.914 14-IIIII 5 69 0.071 0.985 15-15.9 I 1 70 0.014 1.00 Total 70\--1.00- The Graphical Representation of Data Tabular and Graphical Procedures Qualitative Data Tabular Methods Data Graphical Methods Quantitative Data Tabular Methods Graphical Methods Frequency Distribution Rel. Freq. Dist. \% Freq. Dist. Crosstabulation Bar Graph Pie Chart Frequency Distribution Rel. Freq. Dist. Cum. Freq. Dist. Cum. Rel. Freq. Distribution Stem and Display Crosstabulation Dot Plot Histogram Ogive Scatter Diagram Leaf Types of graphs: Bar chart: It a graphic representation used to present data of qualitative type. It is composed of number of bars separated from each other, the width of the bar is not of that importance, but it is preferable to be of the same width (to give true impression), the length of the bar is of importance, and it is drawn proportional to the frequency or percentage. Table 4: The method of delivery of 600 babies born in al-shattra Hospital for the year 2024 Method of delivery No. of births Percentage Normal vaginal delivery 478 Forceps delivery Caesarean section Total 79.7 % 65 57 600 10.8 % 9.5 % 100 % Figure 1: The method of delivery of 600 babies born in Al-shatra Hospital for the year 2024 0 100 200 300 400 500 Caesarean section Forceps delivery Normal vaginal delivery 57 65 478 Figure 2: The method of delivery of 600 babies born in Al-shatra Hospital for the year 2024 0.00% 20.00% 40.00% 60.00% 80.00% Caesarean section Forceps delivery Normal vaginal delivery 9.50% 10.80% 79.70% Pie chart: It is a graphic representation used to present data of qualitative type in shape of circle The size of the slice for each category is determined by the equation f/ n \* 360. Histogram: It is a graphic representation used to present continuous quantitative data arranged in class-interval It is composed of number of bars adherent to each other The width of bars is very important which equal to the width of class interval, and the length of the bars is proportional to the frequency of class interval or its percentage So the area in histogram is very important and it represent 1 unit, 100% equal to the probability Figure 4: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2024 1 3 14 19 14 13 5 1 0 2 4 6 8 10 12 14 16 18 20 8-9-10-11-12-13-14-15-15.9 Freq. Line graph (frequency polygon): It is a graphic representation used to present discrete quantitative data, also it can be derived from histogram (that is used to present continuous quantitative data arrange in class interval) by taking the mid point at the top of each bar, joining them by straight lines The line graph should not be left open, it should be closed by taking the mid point of the class-interval before the first class-interval (it has a frequency of zero) and taking the mid point of the class-interval after the last class interval (it has a frequency of zero) So the line graph will join the X-axis at these two ends. The area of line graph below the line above the X-axis is equal to the area of histogram, equal to one unit, equal to 100%, equal to the probability. Also line graph is used when we want to present two groups by one graph for the purpose of comparison, which is not possible by histogram (as one bar of group 1 will cover another bar from group 2) 0 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 8 9 Chart Title pre-test post-test Spot map (spot chart, map chart): It is a graphic representation used to present data by map. Scatter diagram: It is a graphic representation used to present data for correlation and regression to show the relationship between two quantitative variables. Cumulative relative frequency percentage curve: It is special type of line graph in which X-axis is the variable and the Y-axis is the C.R.F.%, it is used to calculate the value of the median precisely. The shape of the curve or line is of what is called sigmoid shape (sigmoid curve). The characteristics of graphs: Graphs should be simple, easy to be understood and self explanatory. Each graph should have a number. Each graph should have a title written at the bottom of the graph, this title should answer the following question: what, where, when, and who. We should avoid the use of abbreviation and codes, and if we have to use them, we should refer to them inside the graph. Stem-and-Leaf display Another graphical method of representing data DIFFEREN Stem-and-Leaf Plot Frequency Stem & Leaf 5.00 1.00 4.00 4.00 1.00 4.00 8.00.00 6.00 20.00 15.00 6.00 4.00 4.00 2.00 4.00 2.00 2.00 1.00 -7. 00000-6. 0-5. 0000-4. 0000-3. 0-2. 0000-1. 00000000-0. 0. 000000 1. 00000000000000000000 2. 000000000000000 3. 000000 4. 0000 5. 0000 6. 00 7. 0000 8. 00 9. 00 10. 0 7.00 Extremes (\>=12.0) Stem width: 1.00 Each leaf: 1 case (s) Lecture Four Measurement of central location Data summarisation: Data summarisation is either by: Measurements of central tendency (average measurements, measurements of location, and measurements of position) Measurements of variability (dispersion, distribution measurements) Skewness Kurtosis Measures of central tendency Descriptive measure: is a single number used as a means to summarize data. Statistic: is a descriptive measure computed from the data of a sample. Parameter: is a descriptive measure computed from the data of a population. Mean Mode Median Mean It is a measure calculated by adding all the values in a population or a sample and dividing by the number of values that are added. If the "value"= x, and Number of values= n, then the mean= ∑ x1, x2,...xn/ n Properties of the mean Uniqueness Simplicity Since each and every value in a set of data enters into the computation of the mean, it is affected by each value. Therefore, extreme values have an influence on the mean Median Is the value that divides the set into two equal parts after sorting them into an ascending or descending pattern If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude. When the number of values is even, there is no single middle value. Instead, there are two middle values. In this case, the median is taken to be the average of these two middle values, when all values have been arranged in order in order of magnitude. Properties of median 1\. Uniqueness 2\. Simplicity 3\. It is not as affected by extreme values as is the mean Mode Is that value which occurs most frequently in a set of observations. A set of values may have than one mode trimodal). more ( e.g,bimodal, Q1)Write columns for relative frequency and cumulative frequency Q2) When we studied the ages of cancer patients, we found the following: Number of cases: N=your seat number+100 Maximum age: N/2 Minimum age: N/4 Using Sturges\' formula, to determine the optimal number of class intervals and width of each class and draw table (only class interval column) Lecture Five Measurement of variability Measures of dispersion Dispersion: is the variety that a set of observations exhibits Range: is the difference between the largest and smallest value in a set of observations Variance: is a measure of dispersion relative to the scatter of values about their mean = ∑ (xn-x)²/ n-1 Standard deviation: is the square root of the variance. Coefficient of variation: is a measure expresses the standard deviation as a percentage of the mean: = standard deviation / mean\* 100 Q/ Why is it used? Measures of variability (Dispersion): The degree to which numerical (quantitative data) tend to spread about an average value is called variation or dispersion of the data. The variation is something that is in the nature of data, i.e. the data always do not come as one value. There are a lot of measures of variation (dispersion) available, but the most commonly used are: Range: is the difference between the smallest and the largest value in a set of values. Range (R) = Largest value (Xl) -- Smallest value (Xs) The range is of limited use in statistics as a measure of variability because it takes in consideration only two values and neglects the others. e.g. If we have: 10 values, the range will consider only 2 values and neglect the other 8 values, 100 values, the range will consider only 2 values and neglect the other 98 values, and if we have 1000 values, the range will consider 2 values and neglect the other 998 values) These two values, considered by the range, are the two extreme ones (smallest and the largest), which are not of high interest in biostatistics to describe the variation perfectly The uses of range It gives an idea about the extent of data distribution (the scale or range on which the data extend or spread). In determining the width of class interval in case of class interval table (w=R/K). Variance: The variance is defined as the average of the squared deviation of observations away from their mean in a set of observations. Or: The scatter of values about their mean e.g.: Suppose we have five persons with their haemoglobin level (g/dl) measurements (8, 9, 10, 11, 12). The variance = ∑(xn-x)²/ n-1 Hemoglobin level (g/dl) 8 9 10 11 12 Difference, deviation d=(Xn X) 8-10=-2 9-10=-1 10-10=0 11-10=+1 12-10= +2 D2 = (Xn-X)2 4 1 0 1 4 Variance (s2)= ∑d2/(n-l) = 10/(5-1)= 10/4= 2.5 Standard deviation: The SD is defined as the squared root of the variance. It is a measure widely used in biostatistics as a measure of variability If the value of SD is high, it means the data posses a large variation and vice versa Coefficient of variation (CV%) It is the standard deviation expressed in percentage out of the mean. It is used in statistics in the following conditions: To compare the variability of two groups for the same variable but measured by different unite E.g.: Birth weight is measured in Iraq by Kilograms and in the UK in bounds). So we cannot compare the variable of the two groups by SD but we can compare it by CV%. To compare the variability of two groups for the same variable measured by the same unite and they have the same SD value but different means. e.g.: Birth weight = 3.5 ± 0.5 kg in Iraq = 7.0 ± 1.5 Ib in UK CV% in Iraq = SD/mean × 100 = 0.5/3.5 ×100 = 14.285% CV % in UK = SD/mean × 100 = 1.5/7.0 × 100= 12.428% So the variability among Iraqi births is more than in UK by 1.857% difference in CV% e.g.: Birth weight = 3.5 ± 0.5 Kg among healthy born infants = 2.5 ± 0.5 Kg among congenitally abnormal infants CV% in healthy = SD/mean ×100 = 0.5/ 3.5 × 100 = 14.285% CV % in abnormal = SD/mean× 100 = 0.5/ 2.5 × 100 = 20% So the variability among congenitally abnormal is more than among healthy infants by 5.715% difference in CV%. e.g.: The plasma volume of 8 healthy adult males: 2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05, and 3.12 liters Mean =∑x/n = ∑x = \[2.75+ 2.86+ 3.37+ 2.76+ 2.62+ 3.49+ 3.05+ 3.12\]= 24.02 Mean= 24.02/8= 3.002 liters Rearranging the measurements in an increasing order 1st 2nd 3rd 4th 5th 6th 2.62, 2.75, 2.76, 2.86, 3.05, 3.12, 7th 8th 3.37, 3.49 liters Median position= (n+l)/2= (8+l)/2= 4.5 (4th, 5th) Median= The average of 4th value and 5th value Median= (2.86 + 3.05)/ 2= 2.961. This value divides the data into two equal parts Mode: There is no value occurs more than the others, so there is no mode here. Range=Xl - Xs= 3.49-2.62= 0.77 Liter SD=√Variance=√0.097=±0.312 Liter CV% = SD/mean x 100= 0.312/3.002 X 100=10.39% Table: The parity distribution of mothers attending ANC clinic in the PHCC of the Al-Muntezeh PHCC for the year 2010 Parity frequency Cum. f xf r.f. c.r.f. r.f. % c.r.f.% x2f 0 3 3 0 0.03 0.03 3% 3% 0 1 15 18 15 0.15 0.18 15% 18% 15 2 24 42 48 0.24 0.42 24% 42% 96 3 27 69 81 0.27 0.69 27% 69% 243 4 15 84 60 0.15 0.84 15% 84% 240 5 10 94 50 0.10 0.94 10% 94% 250 6 6 100 36 0.06 1.00 6% 100% 216 Total n=100\--∑x=290 1.00\--100%\--∑X2 -1060  For the calculations: Mean ( X ) = f n ∑x =\[(0x3)+(lxl5)+(2x24)+(3x27)+(4xl5)+(5xl0)-f(6x6)\]= 290 f Mean ( ) = X  n 290 = 100 = 2.9 Mode = 3 (it has the highest frequency i.e. 27) Median position = n \+ 1 2 = 50.5 (50th, 51st) 100+ = 1 2 101 = 2 From the column of cumulative frequency, the Median = 3 Or Median = 50th percentile (half of 100% = 50%) so from the column of C.R.F%; the median = 3 Table: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010 hemoglobin (g/dL) Freq. Mid point MP x f Cum. f r.f. c.r.f. r.f.% c.r.f.% MP2 x f 8 -1 8.5 8.5 1 0.014 0.014 1.4% 1.4% 72.25 9-3 9.5 28.5 4 0.043 0.057 4.3% 5.7% 270.75 10-14 10.5 147.0 18 0.2 0.257 20% 25.7% 1543.5 11-19 11.5 218.5 37 0.27 0.528 27.1% 52.8% 2512.75 12 -14 12.5 175.0 51 0.2 0.728 20% 72.3% 2187.5 13-13 13.5 175.5 64 0.186 0.914 18.6% 91.4% 2369.25 14-5 14.5 72.5 69 0.071 0.985 7.1% 98.5% 1051.25 15-15.9 1 15.5 15.5 70 0.014 1.00 1.4% 100% 240.25 Total n= 70- ∑MPf = 841 (∑x)\--1.00\--100%- SMP2f-1 10247.5 (∑x2) = 12.01 g/ dl = 70 841  n M Pf ∑x= ∑MPf = \[(8.5xl)+ (9.5x3)+ (10.5xl4)+ (l 1.5xl9)+ (12.5xl4)+ (13.5xl3) f (14.5x5)+ (15.5xl)\]= 841 Mean ( ) = X n  X X For the calculations: Mean ( ) = 2 n Mode =11.5 g/dl (C.I of 11-11.9) which has the highest frequency i.e. 19) Median position = = = 35th 70 2 From column of cum. F. the median lies in C.I 11-11.9 Median = f r L + x W L= Lower limit of the C.I. containing the median =11 r= remaining number until reaching the position of the median r = (n/2)-the previous cumulative frequency =70/2 - 18= 17 f = frequency of the C.I. containing the median =19 W= width of the C.I. Median = f r 17 L + x W= 11+ x 1 = 11.89g/dl 19 Grouped data

Use Quizgecko on...
Browser
Browser