Introduction To Statistics PDF
Document Details
Uploaded by SpotlessAsh
Jewelle Dawn L. Tuburan
Tags
Summary
This document presents an introduction to fundamental concepts in statistics. It covers data types like quantitative and qualitative data, different scales of measurement (nominal, ordinal, interval, and ratio), and various data collection methods. The document also explores data organization and presentation techniques including tables, charts, and diagrams. It includes different types of data, like quantitative and qualitative.
Full Transcript
INTRODUCTION TO STATISTICS JEWELLE DAWN L. TUBURAN, RMT LEARNING OBJECTIVES: At the end of the lesson, the student are able to: 1. Define statistics and Biostatistics 2. Identify the different scale measurements and; 3. Define and identify the different types of data STATISTICS A branch of...
INTRODUCTION TO STATISTICS JEWELLE DAWN L. TUBURAN, RMT LEARNING OBJECTIVES: At the end of the lesson, the student are able to: 1. Define statistics and Biostatistics 2. Identify the different scale measurements and; 3. Define and identify the different types of data STATISTICS A branch of mathematics that involves the collection, analysis, interpretation, presentation and organization of data. The science of making sense of information and data around us. The cornerstone of epidemiology TWO MAJOR DIVISIONS OF STATISTICS MATHEMATICAL STATISTICS – The study and development of statistical theory and methods in the abstract APPLIED STATISTICS – The application of statistical methods to solve real problems involving randomly generated data and the development of new statistical methodology motivated by real problems. DATA TWO MAJOR TYPES OF DATA QUANTITATIVE DATA Data that can be measured (quantified) and can be written down numerically QUALITATIVE DATA Descriptive data, difficult to measure or count and cannot be written down numerically MEASUREMENTS FOUR LEVELS OR SCALE OF MEASUREMENT NOMINAL DATA Neither measurable or ranked but simply categorized or classified Ex. Address, gender, student’s course, Eye color and Religion Nominal scale data: survival status of propanolol-treated and control patients with myocardial infarction Status 28 days after hospital Propanolol-treated Control patients admission patient Dead 7 17 Alive 38 29 Total 45 46 Survival Rate 84% 63% MEASUREMENTS ORDINAL DATA Shown simply in order of magnitude since there is no standard of measurement of differences Examples: Dichotomous data – “Guilty” or “not guilty” Non-dichotomous data – Likert scale 1- strongly agree 2- agree 3- no opinion 4- disagree 5- strongly disagree MEASUREMENTS INTERVAL DATA Data that belong to a scale according to which the differences between values can be quantified in absolute but not relative terms and for which any zero is merely arbitrary. This type of scale allows for the degree of difference between items, but not the ratio between them Example: Temperature scales RATIO SCALE A scale of measurement of data which permits the comparison of differences of values. It is scale with a fixed zero value. Ex. Distance, Kelvin scale, weight, height Classification of quantitative data DISCRETE DATA A count that can’t be made more precise Ex. 1. # of patients admitted in the hospital, # of px who visited the OPD 2. # of bacteria colonies on a plate CONTINUOUS DATA Could be divided and reduced to finer and finer levels Ex. Blood pressure, serum cholesterol level, height, weight and age **counts are discrete while measurements are continuous GET 1/4 SHEET OF PAPER! Identify the type of data (nominal, ordinal, interval and ratio) represented by each of the following. Confirm your answers by giving your own examples. 1. Blood group 2. Temperature (Celsius) 3. Ethnic group 4. Job satisfaction index (1-5) 5. Severity of disease 6. Height 7. Serum uric acid (mg/100ml) 8. Disease status 9. Number of cases of each reportable disease reported by a health worker 10. voltage Data collection PRIMARY DATA Collected from the original source first hand. Data collected specifically for the purpose in mind Such data are original in character and are mostly generated by surveys conducted by individuals or research institutions. FIELD RESEARCHERS – researchers who collect primary data SECONDARY DATA The contrast of the primary data. Data collected for another purpose in mind Obtained from journals, reports, government publications, publications of professionals and research organizations. DESK RESEARCHERS- researchers who collect secondary data Methods Of Data Collection Data Collection Methods techniques allow us to systematically collect data about our objects of study (people, objects, and phenomena) and about the setting in which they occur. Various data collection techniques can be used such as: Observation Face-to-face and self-administered interviews Postal or mail method and telephone interviews Using available information Focus group discussions (FGD) Data organization and presentation The data collected in a survey is called raw data. Collected data need to be organized in such a way as o condense the information they contain in a way that will show patterns of variation clearly. For the primary objective of this different techniques of data organization and presentation like order array, tables and diagrams are used. Methods Of Data Organization The data collected in a survey is called raw data. Collected data need to be organized to condense the information they contain that will show patterns of variation clearly. Precise methods of analysis can be decided up on only when the characteristics of the data are understood. For the primary objective of this different techniques of data organization and presentation like order array, tables and diagrams are used. FREQUENCY DISTRIBUTIONS For data to be more easily appreciated and to draw quick comparisons, it is often useful to arrange the data in the form of a table, or in one of a number of different graphical forms. The presentation of data in a meaningful way is done by preparing a frequency distribution Array (ordered array) Serial arrangement of numerical data in an ascending or descending order. This will enable us to know the range over which the items are spread and will also get an idea of their general distribution. EXAMPLE: A study in which 400 persons were asked how many full-length movies they had seen on television during the preceding week. The following gives the distribution of the data collected. NUMBER OF MOVIES NUMBER OF PERSONS RELATIVE FREQUENCY (%) 0 72 18.0 1 106 26.5 2 153 38.3 3 40 10.0 4 18 4.5 5 7 1.8 6 3 0.8 7 0 0.0 8 1 0.3 TOTAL 400 100.0 Number of movies represents the variable under consideration Number of persons represents the frequency Whole distribution is called frequency distribution particularly simple frequency distribution. CATEGORICAL DISTRIBUTION non-numerical information can also be represented in a frequency distribution. Seniors of a high school were interviewed on their plan after completing high school. The following data give plans of 548 seniors of a high school. SENIOR’S PLAN NUMBER OF SENIORS Plan to attend college 240 May attend college 146 Plan to or may attend a vocational 57 school Will not attend any school 105 TOTAL 548 GROUPED FREQUENCY DISTRIBUTION In connection with large sets of data, a good overall picture and sufficient information can often be conveyed by grouping the data into a number of class intervals. A social scientist who wants to study the age of persons arrested in a country. Age (years) Number of Persons Under 18 1,748 18-24 3,325 25-34 3,149 35-44 1,323 45-54 512 55 and over 335 Total 10,392 DETERMINATION OF NUMBER OF CLASSES (k) Sturge’s Formula K = 1 + 3.22 x log(n), where n is the number of observations LENGTH OR WIDTH OF CLASS INTERVAL (w) W = (Maximum value – Minimum value)/K W= Range/K DETERMINATION OF CLASS LIMITS The lower limit of the first class be determined in such a manner that frequency of each class get concentrated near the middle of the class interval. It is important to watch whether they are given to the nearest inch or to the nearest tenth of an inch, whether they are given to the nearest ounce or to the nearest hundredth of an ounce, and so forth. DETERMINATION OF CLASS LIMITS Weight (kg) Weight (kg) Weight (kg) 10-14 10-14.9 10.00-14.99 15-19 15.9-19.9 15.00. 19.99 20-24 20.0-24.9 20.00-24.99 25-29 25.0-29.9 25.00-29.99 30-34 30.0-34.9 30.00-34.99 Ex. Construct a group frequency distribution of the following data on the amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week: 23 24 18 14 20 24 24 26 23 21 16 15 9 20 22 14 13 20 19 27 29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15 21 25 16 30 17 22 29 29 18 25 20 16 11 17 12 15 24 25 21 22 17 18 15 21 20 23 18 17 16 15 26 23 22 11 16 18 20 23 19 17 15 20 10 Using the above formula, K = 1 + 3.322 × log (80) = 7.32 ≈ 7 classes Given: Maximum value = 38 Minimum value = 10 Formula: Range/W Range = 38 – 10 =28 W = 28/7 = 4 Using width of 5, we can construct grouped frequency distribution for the above data as: Time spent (hours) Tally Frequency Cumulative frequency 10-14 8 8 15-19 28 36 20-24 27 63 25-29 12 75 30-34 4 79 35-39 1 80 For our data of patients, for example: Given: n = 50 k = 1 + 3.322(log1050) = 6.64 = 7 w = R / k = (89 - 1)/7 w= 12.57 = 13 CUMULATIVE AND RELATIVE FREQUENCIES CUMULATIVE FREQUENCY - When frequencies of two or more classes are added up. This frequencies help as to find the total number of items whose values are less than or greater than some value. RELATIVE FREQUENCY - express the frequency of each value or class as a percentage to the total frequency. CONSTRUCTING CUMULATIVE FREQUENCY DISTRIBUTION “Less than cumulative frequency distribution” – start the cumulation from the lowest size of the variable to the highest size The most common cumulative frequency “More than cumulative frequency distribution” - cumulation is from the highest to the lowest value Mid-Point of a class interval and the determination of Class Boundaries Mid-point or class mark (Xc) - the value of the interval which lies mid-way between the lower true limit (LTL) and the upper true limit (UTL) of a class Calculated as True limits (or Class boundaries) determined mathematically to make an interval of a continuous variable continuous in both directions, and no gap exists between classes. Example: Frequency distribution of weights (in Ounces) of Malignant Tumors Removed from the Abdomen of 57 subjects WEIGHT CLASS BOUNDARIES XC FREQ CUMULATIVE RELATIVE FREQ. FREQ. (%) 10-19 9.5-19.5 14.5 5 5 0.0877 20-29 19.5-29.5 24.5 19 24 0.3333 30-39 29.5-39.5 34.5 10 34 0.1754 40-49 39.5-49.5 44.5 13 47 0.2281 50-59 49.5-59.5 54.5 4 51 0.0702 60-69 59.5-69.5 64.5 4 55 0.0702 70-79 69.5-79.5 74.5 2 57 0.0352 TOTAL 57 1.0000 Methods Of Data Presentation I. STATISTICAL TABLES- an orderly and systematic presentation of numerical data in rows and columns Rows (stubs) – horizontal arrangements Columns (captions) – vertical arrangements The use of tables for organizing data involves grouping the data into mutually exclusive categories of the variables and counting the number of occurrences (frequency) to each category. Ex. Sex (Male, Female), Marital status (single, Married, divorced, widowed), Blood group (A, B, AB, O), Method of Delivery (Normal, forceps, Cesarean section), etc. are some qualitative variables with exclusive categories. Construction of tables The following general principles should be addressed in constructing tables: 1. Tables should be as simple as possible. 2. Tables should be self-explanatory. For that purpose Title should be clear and to the point( a good title answers: what? when? where? how classified ?) and it be placed above the table. Each row and column should be labelled. Numerical entities of zero should be explicitly written rather than indicated by a dash. Dashed are reserved for missing or unobserved data. Totals should be shown either in the top row and the first column or in the last row and last column. 3. If data are not original, their source should be given in a footnote. A. The Simple or one-way table The simple frequency table is used when the individual observations involve only to a single variable whereas the cross tabulation is used to obtain the frequency distribution of one variable by the subset of another variable. Table 1: Overall immunization status of children in Adami Tullu Woreda, Feb. 1995 IMMUNIZATION STATUS NUMBER PERCENT Not immunized 75 35.7 Partially immunized 57 27.1 Fully immunized 78 37.2 TOTAL 210 100.0 B. Two-way table This table shows two characteristics and is formed when either the caption or the stub is divided into two or more parts. Table 2: TT immunization by marital status of the women of childbearing age IMMUNIZATION STATUS MARITAL STATUS Immunized Non Immunized No. % No. % Total Single 58 24.7 177 75.3 235 Married 156 34.7 294 65.3 450 Divorced 10 35.7 18 64.3 28 Widowed 7 50.0 7 50.0 14 TOTAL 231 31.8 496 68.2 727 C. Higher Order Table When it is desired to represent three or more characteristics in a single table. Example: A study was carried out on the degree of job satisfaction among doctors and nurses in rural and urban areas. To describe the sample a cross-tabulation was constructed which included the sex and the residence (rural urban) of the doctors and nurses interviewed. Table 3: Distribution of Health Professional by Sex and Residence RESIDENCE PROFESSION/SEX Urban Rural Total Male 8 35 43 Doctors Female 2 16 18 Male 46 36 82 Nurses Female 23 77 100 TOTAL 79 164 243 Methods Of Data Presentation II. DIAGRAMMATIC REPRESENTATION OF DATA The relationship between numbers of various magnitudes can usually be seen more quickly and easily from a graph than from a table. It is simpler and more easily understandable It consists in presenting statistical material in geometric figures, pictures, maps and lines or curves. Bar charts and pie chart commonly used for qualitative or quantitative discrete data. Histograms and frequency polygons used for quantitative continuous data. A. Simple bar chart It is a one-dimensional diagram in which the bar represents the whole of the magnitude. The height or length of each bar indicates the size (frequency) of the figure represented. Fig. 1. Immunization status of Children in Adami Tulu Woreda B. Multiple bar chart In this type of chart the component figures are shown as separate bars adjoining each other. The height of each bar represents the actual value of the component figure. It depicts distributional pattern of more than one variable Fig. 2 TT Immunization status by marital status of women 15-49 years C. Pie-chart (qualitative or quantitative discrete data) It is a circle divided into sectors so that the areas of the sectors are proportional to the frequencies. 36% 37% 27% Fully Immunized Partially Immunized Not Immunized Fig. 3 Immunization status of children in Adami Tullu Woreda D. Histograms (quantitative continuous data) Histogram is the graph of the frequency distribution of continuous measurement variables. Example: Consider the data on time (in hours) that 80 college students devoted to leisure activities during a typical school week: Fig. 4 Histogram for amount of time college students devoted to leisure activities E. FREQUENCY POLYGON If we join the midpoints of the tops of the adjacent rectangles of the histogram with line segments a frequency polygon is obtained. When the polygon is continued to the X-axis just out side the range of the lengths the total area under the polygon will be equal to the total area under the histogram. Example: Consider the data on time (in hours) that 80 college students devoted to leisure activities during a typical school week: 30 25 No. of students 20 15 10 5 0 1 2 3 4 5 6 7 Midpoints of class intervals Fig. 5 Frequency polygon curve on time spent for leisure activities by students F. LINE DIAGRAM The line graph is especially useful for the study of some variables according to the passage of time. Time, in weeks, months or years is marked along the horizontal axis The value of the quantity that is being studied is marked on the vertical axis. The line graph is suitable for depicting a consecutive trend of a series over a long period. Example: Malaria parasite rates as obtained from malaria seasonal blood survey results. 6 5 4 3 Rate (%) 2 1 0 1967 1969 1971 1973 1975 1977 1979 Year Fig. 6 Malaria Parasite Prevalence rates in Ethiopia, 1967-1979 VARIABLES VARIABLE Any entity that can take on different values Anything that can vary can be considered a variable DATA ANALYSIS TWO MAIN METHODS DESCRIPTIVE STATISTICS Summarizes data from a sample Most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution’s central or typical value Dispersion (variability) characterizes the extent to which members of the distribution depart from its center and each other. INFERENTIAL STATISTICS Draws conclusion from data that are subject to random variation. Made under the framework of probability theory, which deals with the analysis of random phenomena SAMPLING CENSUS Study of every unit, everyone or everything, in a population. Also known as “complete enumeration” means a complete count. SAMPLE A subset of a population that represents a population. It implies a smaller size than the population Since it is only a sample, it is not a hundred percent accurate BIOSTATISTICS The branch of applied statistics directed toward applications in the health sciences and biology. It is an innovative field that involves the design, analysis, and interpretation of data for studies in public health and medicine. HISTORY 18TH CENTURY Statistical methods are used to resolve therapeutic debates on the practice of smallpox inoculation Involved inserting actual smallpox pustules under an individual’s skin in the hope of creating a mild form of the disease that would induce later immunity. Contradict the idea of “Primum non nocere” or First, Do No Harm principle adhered by all medical professionals. John Arburthnot A london physician published an anonymous pamphlet in 1722 Examined the London Bills of Mortality from earlier years and estimated the chance of dying from naturally occurring smallpox was 1:10. He also asserted that the chance of dying from inoculation-induced smallpox was 1:100. HISTORY JAMES LIND Credited with designing a controlled clinical trial; Modern Father on the controlled clinical trial. Intentionally dividing the participants into two or more comparable groups to test hypotheses) In 1757, he had to deal the outbreak of SCURVY. Selected 12 sailors and divided them into six groups of two. DANIEL’S FASTING Legumes and water for 10 days Conclusion Ultimate goal in statistics: NOT TO SUMMARIZE THE DATA BUT FULLY UNDERSTAND THEIR COMPLEX RELATIONSHIPS ASSIGNMENT! The National Institute of Cholera and Enteric Disease (NCED) is a specialized Institute of the Indian Council for Medical Research (ICMR) located in Kolkota, West Bengal, India. Cholera is highly endemic in this region of India so while the NCED works as a reference center for the entire country, the institute is mainly active in West Bengal In October 2004, the epidemiologist assigned to “North 24 Parganas” (a district in West Bengal) conducted a routine visit to the NCED and its laboratory. The microbiologist in charge of cholera mentioned to him that during the previous month (September), the average isolation of Vibrio cholerae from 19 stool specimens each month between January and August, 2004 JAN FEB MAR APR MAY JUN JUL AUG SEP Number of stool samples from 0 2 5 20 12 10 7 15 65 which V. cholerae was isolated ASSIGNMENT! A. What scale of measurement is presented in Table No.1? a. Is this a primary or secondary data? b. This data is a quantitative data. Is this continuous or discrete? B. How would you present the data other than tabulating it? Illustrate. C. RESEARCH ON CHOLERA. How many samples are needed to confirm the diagnosis during an outbreak?