Introduction To Biostatistics PDF
Document Details
Uploaded by Deleted User
null
Dr. Pranab Kr. Banerjee
Tags
Summary
This book introduces the fundamental principles of biostatistics. It is intended for graduate and postgraduate students of biological sciences and includes information and examples pertaining to biological sciences.
Full Transcript
INTRODUCTION TO BIOSTATISTICS [A TEXTBOOK OF BIOMETRY] blank page INTRODUCTION TO BIOSTATISTICS [A TEXTBOOK OF BIOMETRY] [For the Graduate & PG Students of Biological Sciences] Dr. PRANAB Kr. BANERJEE...
INTRODUCTION TO BIOSTATISTICS [A TEXTBOOK OF BIOMETRY] blank page INTRODUCTION TO BIOSTATISTICS [A TEXTBOOK OF BIOMETRY] [For the Graduate & PG Students of Biological Sciences] Dr. PRANAB Kr. BANERJEE M.Sc.(C.U.), Ph.D.(C.U.), FZS, FZSEI Associate Professor Chairperson, Department of P. G. Studies in Zoology Serampore College, Serampore, Hooghly Department of Microbiology, R.K. Mission Vidyamandira Belur Math, Howrah S. CHAND & COMPANY LTD. (An ISO 9001 : 2000 Company) RAM NAGAR, NEW DELHI - 110 055 S. CHAND & COMPANY LTD. (An ISO 9001 : 2000 Company) Head Office: 7361, RAM NAGAR, NEW DELHI - 110 055 Phone: 23672080-81-82, 9899107446, 9911310888 Fax: 91-11-23677446 Shop at: schandgroup.com; e-mail: [email protected] Branches : AHMEDABAD : 1st Floor, Heritage, Near Gujarat Vidhyapeeth, Ashram Road, Ahmedabad - 380 014, Ph: 27541965, 27542369, [email protected] BENGALURU : No. 6, Ahuja Chambers, 1st Cross, Kumara Krupa Road, Bengaluru - 560 001, Ph: 22268048, 22354008, [email protected] BHOPAL : Bajaj Tower, Plot No. 243, Lala Lajpat Rai Colony, Raisen Road, Bhopal - 462 011, Ph: 4274723. [email protected] CHANDIGARH : S.C.O. 2419-20, First Floor, Sector - 22-C (Near Aroma Hotel), Chandigarh -160 022, Ph: 2725443, 2725446, [email protected] CHENNAI : 152, Anna Salai, Chennai - 600 002, Ph: 28460026, 28460027, [email protected] COIMBATORE : No. 5, 30 Feet Road, Krishnasamy Nagar, Ramanathapuram, Coimbatore -641045, Ph: 0422-2323620 [email protected] (Marketing Office) CUTTACK : 1st Floor, Bhartia Tower, Badambadi, Cuttack - 753 009, Ph: 2332580; 2332581, [email protected] DEHRADUN : 1st Floor, 20, New Road, Near Dwarka Store, Dehradun - 248 001, Ph: 2711101, 2710861, [email protected] GUWAHATI : Pan Bazar, Guwahati - 781 001, Ph: 2738811, 2735640 [email protected] HYDERABAD : Padma Plaza, H.No. 3-4-630, Opp. Ratna College, Narayanaguda, Hyderabad - 500 029, Ph: 24651135, 24744815, [email protected] JAIPUR : A-14, Janta Store Shopping Complex, University Marg, Bapu Nagar, Jaipur - 302 015, Ph: 2719126, [email protected] JALANDHAR : Mai Hiran Gate, Jalandhar - 144 008, Ph: 2401630, 5000630, [email protected] JAMMU : 67/B, B-Block, Gandhi Nagar, Jammu - 180 004, (M) 09878651464 (Marketing Office) KOCHI : Kachapilly Square, Mullassery Canal Road, Ernakulam, Kochi - 682 011, Ph: 2378207, [email protected] KOLKATA : 285/J, Bipin Bihari Ganguli Street, Kolkata - 700 012, Ph: 22367459, 22373914, [email protected] LUCKNOW : Mahabeer Market, 25 Gwynne Road, Aminabad, Lucknow - 226 018, Ph: 2626801, 2284815, [email protected] MUMBAI : Blackie House, 103/5, Walchand Hirachand Marg, Opp. G.P.O., Mumbai - 400 001, Ph: 22690881, 22610885, [email protected] NAGPUR : Karnal Bag, Model Mill Chowk, Umrer Road, Nagpur - 440 032, Ph: 2723901, 2777666 [email protected] PATNA : 104, Citicentre Ashok, Govind Mitra Road, Patna - 800 004, Ph: 2300489, 2302100, [email protected] PUNE : 291/1, Ganesh Gayatri Complex, 1st Floor, Somwarpeth, Near Jain Mandir, Pune - 411 011, Ph: 64017298, [email protected] (Marketing Office) RAIPUR : Kailash Residency, Plot No. 4B, Bottle House Road, Shankar Nagar, Raipur - 492 007, Ph: 09981200834, [email protected] (Marketing Office) RANCHI : Flat No. 104, Sri Draupadi Smriti Apartments, East of Jaipal Singh Stadium, Neel Ratan Street, Upper Bazar, Ranchi - 834 001, Ph: 2208761, [email protected] (Marketing Office) SILIGURI : 122, Raja Ram Mohan Roy Road, East Vivekanandapally, P.O., Siliguri-734001, Dist., Jalpaiguri, (W.B.) Ph. 0353-2520750 (Marketing Office) VISAKHAPATNAM: Plot No. 7, 1st Floor, Allipuram Extension, Opp. Radhakrishna Towers, Seethammadhara North Extn., Visakhapatnam - 530 013, (M) 09347580841, [email protected] (Marketing Office) © 2004, Dr. Pranab Kumar Banerjee All rights reserved. No part of this publication may be reproduced or copied in any material form (including photo copying or storing it in any medium in form of graphics, electronic or mechanical means and whether or not transient or incidental to some other use of this publication) without written permission of the copyright owner. Any breach of this will entail legal action and prosecution without further notice. Jurisdiction : All desputes with respect to this publication shall be subject to the jurisdiction of the Courts, tribunals and forums of New Delhi only. First Ed itio n 2004 Re vised Ed itio n 2005 Third Ed itio n 2006 Re p rint 2007, 2008, 2009 (Twic e) Re vised a nd Fo urth Enla rg e d Ed itio n 2011 ISBN : 81-219-2329-8 Code : 03A 303 PRINTED IN INDIA By Rajendra Ravindra Printers Pvt. Ltd., 7361, Ram Nagar, New Delhi -110 055 and published by S. Chand & Company Ltd., 7361, Ram Nagar, New Delhi -110 055. PREFACE TO THE FOURTH REVISED AND ENLARGED EDITION To begin with I acknowledge whole heartedly to my beloved students as well as to my colleagues, friends of different colleges and universities, throughout India for the admiration and reception they have shown to my book “Introduction to Biostatistics” (1st to 3rd edition). I do hope, they will also extend their good wishes and appreciation for this enlarged, revised and elegant edition (4th) of this book. To enhance the utility of this book, a thorough revision and recast along with the addition of numerous examination oriented solved problems as well as a number of topics viz Set theory, Binomial expansion, Permutation, Combination and non parametric statistics have been incorporated. Theoretical discussion as well as solution of problems have been represented in a simple, lucid and unambiguous language so as to cater to the needs of students of all streams of Biosciences. (Zoology, Botany, Physiology, Microbiology & Biotechnology etc.). I claim no originality for the matter presented in this book but the method of presentation and illustration is my own. I have tried my level best to present the entire text in such as manner that kindles the interest of a student in Biostatistics. In preparing this book I have taken profuse help from several books. I have expressed my acknowledgment in the bibliography. I am grateful to all those teachers who have helped a lot by giving their valuable suggestions. I convey my respect and pronams to Swami Atmopriyanandji Maharaj (Vice Chancellor R.K. Mission Vivekananda University Belurmath, Howrah). I extend my respect and gratitude to my teacher Prof. Rabindra Nath Chatterjee (Genetics Research Unit Dept. of Zoology, University of Calcutta), Prof. Dhrubojyoti Chatterjee (Pro-Vice Chancellor, Academic, University of Calcutta), Prof. Chandra Sekhar Chakraborty, (Vice Chancellor, West Bengal University of Animal and Fishery Sciences) and Dr. Tarit Kumar Banerjee (Associate Professor Dept. of Zoology R.P.M. College Uttarpara). I also extend my thanks to my Departmental Colleagues (both teaching and non teaching) for their encouragement and constant inspiration. My sincerest thanks are forwarded to management, DTP and Editorial team of S. Chand & Company Ltd. for their encouragement and neat execution to the revised & enlarged edition of this book in a suitable form. Last but not the least I extend my sincere thanks to Mrs. Mandira Banerjee (Head Mistress Baidyabati Charusila Bose Balika Vidyalaya) and Miss Debdatta Banerjee (Daughter) for their endurance and active assistance during the preparation of this book. Dr. Pranab Kumar Banerjee Dept. of Zoology Serampore College Disclaimer : While the authors of this book have made every effort to avoid any mistake or omission and have used their skill, expertise and knowledge to the best of their capacity to provide accurate and updated information. The author and the publisher does not give any representation or warranty with respect to the accuracy or completeness of the contents of this publication and are selling this publication on the condition and understanding that they shall not be made liable in any manner whatsoever. The publisher and the author expressly disclaim all and any liability/responsibility to any person, whether a purchaser or reader of this publication or not, in respect of anything and everything forming part of the contents of this publication. The publisher shall not be responsible for any errors, omissions or damages arising out of the use of the information contained in this publication. Further, the appearance of the personal name, location, place and incidence, if any; in the illustrations used herein is purely coincidental and work of imagination. Thus the same should in no manner be termed as defamatory to any individual. (v) blank page (vi) PREFACE TO THE FIRST EDITION This book entitled “Introduction to Biostatistics” is based on my experience of teaching Biostatistics to the students of B.Sc. (Zoology Botany) Courses of Indian Universities. Biostatistics is the study of the application of statistical methodology to analyse biological variations, correlation and regression in biological measurement which is also known as “Biometry”. I claim no originality for the matter presented in the text but the method of presentation is my own. This book has been written in a clear, lucid manner to cover theoretical, practical and applied aspect of statistics. It will help the graduate and post graduate students of biological sciences (Zoology, Botany & Cytogenetics), Psychology, Education of Indian universities. Research workers and teachers of Biological and Medical Sciences may also get help from this book. I believe large number of illustrative and various types of examples would make this book easy to understand without having any external help. In preparing this book, I have been greatly helped by several books. I express my acknowledgment in the bibliography. I take the opportunity to express my indebtedness to Mr. R. M. Nath, Manager, S. Chand & Co. Ltd. and my students C. Chowdhary whose encouragement has driven me to prepare this book. The enthusiastic inspiration from Swami Athmaprianandaji Maharaj (Principal R.K. Mission Vidyamandira, Belur Math) Dr. Lalchun Nunga (Principal Serampore College) Prof. R.N. Chatterjee (Head of the Department Zoology, C.U.) Dr. T.K. Banerjee (Reader, Dept. of Zoology, R.P.M. College, Uttarpara), and Mr. R. Das (Dept. of Mathematics, Serampore College) was always a booster for the preparation of this book. I shall be grateful for bringing out the mistakes and misprints top my notice which shall be removed in the next generation. Comments criticisms and suggestions from students as well as from many teachers friends for corrections and improvement of this book will be greatly acknowledged. My heartful, thanks to Mr. Ravindra. Kr. Gupta, Managing Director & Mr. Navin Joshi, General Manager of S. Chand & Co. Ltd. for kind co-operation in the preparation and brought out the book in time & in nice form. Last but not least I must sincerely thanks to Mrs. Mandira Banerjee for her continuous and untiring active assistance during this dedicated work. Dr. Pranab Kumar Banerjee Dept. of Zoology Serampore College (vii) blank page (viii) CONTENTS Chapters Pages 1. Preliminary Concept -------------------------------------------------------------------------------- 1–9 2. Frequency Distribution -------------------------------------------------------------------------- 10–14 3. Graphical Representation of Data ------------------------------------------------------------- 15–27 4. Central Tendency --------------------------------------------------------------------------------- 28–54 5. Measures of Variation --------------------------------------------------------------------------- 55–99 6. Theoretical Distribution ---------------------------------------------------------------------- 100–133 7. Skewness, Kurtosis and Moments ---------------------------------------------------------- 134–137 8. Set Theory and Probability ------------------------------------------------------------------ 138–163 9. Chi-Square Test -------------------------------------------------------------------------------- 164–219 10. Student T Distribution ------------------------------------------------------------------------ 220–241 11. Z-test -------------------------------------------------------------------------------------------- 242–247 12. F-Test or Fisher’s F Test --------------------------------------------------------------------- 248–256 13. Correlation -------------------------------------------------------------------------------------- 257–313 14. Regression -------------------------------------------------------------------------------------- 314–345 15. Analysis of variances (ANOVA) ------------------------------------------------------------ 346–366 16. Non-parametric Statistics --------------------------------------------------------------------- 367–368 17. Statistical Tables ------------------------------------------------------------------------------- 369–378 18. Notation and Important Formulae ---------------------------------------------------------- 379–384 19. Logarithms Tables ----------------------------------------------------------------------------- 385–388 Bibliography------------------------------------------------------------------------------------------ 389 (ix) (x) 1 CHAPTER PRELIMINARY CONCEPT INTRODUCTION TO BIOSTATISTICS Statistics: It refers to the subject of scientific activity which deals with the theories and methods of collection, analysis and interpretation of such data. Biostatistics: This term is used when tools of statistics are applied to the data that is derived from biological organisms. Characteristics of Statistics: 1. Statistics are the aggregate of facts. 2. Statistics are numerically expressed. 3. Statistics are affected by multiplicity of causes and not by single cause. 4. Statistics must be related to some field of inquiry. 5. Statistics should be capable of being related to each other, so that some cause & effect relationship can be established. 6. The reasonable standard of accuracy should be maintained in statistics. Importance and Usefulness of Statistics: 1. Statistics help in presenting large quantity of data in a simple and classified form. 2. It gives the methods of comparison of data. 3. It enlarges individual mind. 4. It helps in finding the conditions of relationship between the variables. 5. It tries to give material for the business man as well as the administrators so as to serve as a guide in planning and shaping future policies and programmes. 6. It proves useful in number of fields viz. railways, Banks, Army, etc. Limitation of Statistics: 1. Statistics laws are held to be true on the average and in the long run. 2. Statistics can be used to analyse only collective matters not individual events. 3. It is applicable only to quantitative data. 4. Statistical results are ascertained by samples. If the selection of samples is biased, errors will accumulate and results will not be reliable. 1 2 Introduction to Biostatistics 5. The greatest limitation of statistics is that only one who has an expert knowledge of statistical methods can efficiently handle statistical data. Application and Uses of Biostatistics: 1. In Physiology and Anatomy (i) To define what is normal or healthy in a population and to find limits of normality in variables. (ii) To find the difference between the means and proportions of normal at two places or in different periods. (iii) To find out correlation between two variables X and Y such as height and weight. 2. In Pharmacology (i) To find out the action of drug—a drug is given to animals & humans to observe the changes produced are due to the drug or by chance. (ii) To compare the action of different drugs or two successive dosages of the same drug. (iii) To find out the relative potency of a new drug with respect to a standard drug. 3. In Medicine (i) To compare the efficacy of a particular drug. For this, the percentage of cured & died in the experiment & control groups. (ii) To find out an association between two attributes such as cancer and smoking. (iii) To identify signs and symptoms of a disease or syndrome. Cough & typhoid is found by chance and fever is found in almost every case. 4. In Community Medicine and Public Health (i) To test usefulness of sera and Vaccines in the field-the percentage of attacks or deaths among the vaccinated subject is compared with that among the unvaccinated ones to find whether the difference observed is statistically significant. (ii) In epidemiological studies—the role of causative factors is statistically tested. (iii) In public health, the measures adopted are evaluated. DATA: Data is a collection of observations expressed in numerical figures. The collection may be done in two ways. (a) by complete enumeration and (b) simple survey method. Data is always in collective sense and never be used singular. Types of Data: The statistical data can be divided into two broad categories: (a) Qualitative (b) Quantitative. Qualitative Data : In this type of data, there is no numerical relation with one another. Example: Skin colour—brown, black, white Eye colour—blue, brown Sex—Male, Female. Quantitative Data: 1. In this type of data, there is numerical relation with one another. 2. It may be continuous or discrete. Preliminary Concept 3 Example: Discrete = Number of books, number of students. Continuous = Height or Weight of person. Qualitative data Quantitative data 1. Always Discrete. 1. Discrete or continuous. 2. No magnitude. 2. Have magnitude. 3. Persons with same character are counted to 3. Arranged by both character and frequency. form groups. 4. Results are expressed as ratio or proportion. 4. Such data are analysed through statistical method e.g., mean, median, mode, S.D. etc. A. According to source of data collection: (a) Primary data: Directly from field or experiment. (b) Secondary data: Obtained from primary data or review. B. According to variable: (a) Univariable. (b) Bivariable. (c) Multivariable. C. According to compilation: (a) Raw data: Data before compilation. (b) Derived data: Calculated from primary value of data. PRIMARY DATA: These data are collected directly from the field of enquiry for a specific purpose. These are raw data or data in original nature, and directly collected from population. The collection of primary data may be made through either by complete enumeration or sampling survey methods. SECONDARY DATA: These are numerical information which have been already collected by some agency for a specific purpose and are subsequently compiled from that source for application in different connection. In other words, data used by any other agency than the collecting authority will be termed as secondary data. COLLECTION OF PRIMARY DATA: The following methods are generally used for collection of primary data: (a) Direct personal observation. (b) Indirect oral investigation. (c) Questionnaires sent by mail. (d) Schedules sent through investigators. QUESTIONNAIRE: It is a proforma containing a sequence of questions relevant to a statistical enquiry. It is used for collection of primary data from individual persons through their response to the set of questions. RELATIVE ADVANTAGES OF PRIMARY DATA: 1. Primary data provides with detailed information but in secondary data some information may be suppressed. 4 Introduction to Biostatistics 2. Primary data is free from transcribing errors and estimation errors where as a secondary may contain such errors. 3. Secondary data normally do not contain information regarding methods of procuring data where as primary data often include them. 4. Cost effectiveness is a vital plus point for using secondary data. Thus time, cost suitability and accuracy are the essential factors whether we would use primary or secondary data. POPULATION: It is an entire group of people or study elements-persons, things or measurements having some common fundamental characteristics. (a) Finite: If a population consist of fixed number of value, it is said to be finite e.g., number of days in a week. (b) Infinite: If a population consist of an endless succession of values, it is said to be infinite e.g., number of animals in ocean. SAMPLING: The technique of obtaining information about the whole group by examining only the part of the whole group is called sampling. Types of Sampling: (a) Random sample (Probability sample). (b) Non-random sample (Non-probabilities sample). Objectives of Sampling: 1. Estimation of population parameter (mean, SD etc.) from the sample statistics. 2. To test hypothesis about the population from which the sample or samples are drawn. SAMPLE: It is a relatively small group of selected number of individuals or objects or cases drawn from a particular population and is used to throw light on the population characteristics. Fig. 1.1 RANDOM SAMPLE: It is a sample chosen in a very specific way and has been selected in such a way that every element in the population has an equal opportunity (unbiased) of being included in the sample. CHARACTERISTIC: The term ‘characteristic’ means a quality possessed by an individual (i.e., object, item of population). Height, weight, age etc. are characteristics. In statistics, characteristics are of two kinds: (a) Non-measurable ‘characteristics’ (attributes) (b) Measurable ‘characteristics’ (variables). (a) Attributes: Attributes are the non-measurable characteristics which can not be numerically expressed in terms of unit. These are qualitative object. Preliminary Concept 5 For example—Religion, Nationality, Illiteracy etc. (b) Variables: Variables are the measurable characteristics which can be numerically expressed in terms of some unit. These are quantities which are capable of being measured by quantitative methods directly. For example—Height in inches, cm, weight in kg, pound, marks in examination etc. (i) Discrete Variables (Discontinuous/meristic): There are the quantities which can be measured in whole integral values. It does not take fractional value. Example—Number of books marks in examination. (ii) Continuous Variables: These are quantities which can take any value in specified range. Thus it can take integral and fractional values both. Example—Heights, weights etc. Discrete Variables: Specimen No. Number of bristles in Drosophila 1. 3 2. 7 3. 10 4. 8 5. 6 Continuous Variable: Specimen No. Body Weight of some crabs 1. 5.026 gram 2. 3.732 gram 3. 4.875 gram 4. 3.781 gram 5. 6.023 gram STATISTICAL ERROR: In statistical terminology the word ‘error’ is used in special sense. Error shows the extent to which the observed value of a quantity exceeds the true value. Error = Observed value – True value. TYPES: Statistical error may be classified as: (a) Biased errors: (which arise due to personal prejudices or bias of investigator & informants). (b) Unbiased errors: (which enter into statistical enquiry due to chance causes). ARRAY: The presentation of data in ascending order of magnitude is called array. TALLY: (i) A tally mark is an upword slanted stroke (I ) which is put against each occurrence of value. (ii) When value occurs more than four times the fifth occurrence is denoted by a cross (\) tally mark, running diagonally, across the four tally marks. This facilitates the counting of tally marks at the end. 6 Introduction to Biostatistics (iii) The total count of tally against each value is called its frequency. (iv) A frequency distribution with individual values is called simple frequency distribution. Example: Form a frequency table for the following variables: 51, 59, 52, 51, 60, 68, 63, 64, 65, 66, 68, 52, 59 60, 58, 51, 54, 55, 56, 61, 62, 69, 70, 58, 69, 65 67, 63, 63, 62, 61, 51, 59, 63, 68, 67, 69, 53, 53 51, 59, 56, 55, 70, 65, 62, 65, 66, 69, 70, 52, 55 64, 65, 69, 61, 63, 54, 64, 61, 61, 62, 51, 52, 52, 54, 55, 52, 52, 66. Solution: Values of variables Tally Frequency 51 |||| | 6 52 |||| || 7 53 || 2 54 ||| 3 55 |||| 4 56 || 2 58 || 2 59 |||| 4 60 || 2 61 |||| 5 62 |||| 4 63 |||| 5 64 ||| 3 65 |||| 5 66 ||| 3 67 || 2 68 ||| 3 69 |||| 5 70 ||| 3 Total 70 CLASSIFICATION: It is the process of arranging the collected statistical information under different categories or classes according to some common characteristics possessed by an individual member. Types of Classification: There are four types of classification. (a) On qualitative basis: Here non measurable characteristics are classified. (b) On quantitative basis: Here measurable characteristics are classified. (c) On time basis: Here the statistical data are arranged in order of their time of occurrence. (d) On the geographical basis: The total population of a country may be classified by states, or districts. The basis of classification in such cases is by geographical regions. Preliminary Concept 7 METHOD OF PRESENTATION OF STATISTICAL DATA: Statistical data are presented in three processes: (a) Textual Presentation: (i) Numerical data presented in a descriptive form are called textual presentation. (ii) It is lengthy. Some words may repeat several times in the text. (iii) It becomes difficult to grasp salient points in a textual presentation. (b) Tabular Presentation: (i) The logical and systematic presentation of numerical data in rows and columns designed to simplify the presentation and facilitate comparison is termed as tabulation. (ii) Tabulation is thus a form of presenting quantitative data in condensed and coincise form so that numerical figures are capable of easy & quick reception by the eyes. (iii) It is more convenient than textual presentation. (c) Graphical Presentation: The presentation of quantitative data by graphs and charts are termed as graphical presentation. Tabulation: It may be defined as the logical and systematic presentation of numerical data in rows and columns designed to simplify the presentation and facilitate comparisons. The Advantages of tabulation are: (i) It enables the significance of data readily understood and leaves a lasting impression than textual impression. (ii) It facilitates quick comparison of statistical data shown between rows and columns. (iii) Errors and omissions can be readily detected when data are tabulated. (iv) Repetition of explanatory terms and phrases can be avoided, and the concise tabular form clearly reveals the characteristics of data. Types of Tabulation: There are two types of tabulation: (a) Simple tabulation: It contains data in respect of one characteristic only. (b) Complex tabulation: It contains data of more than one characteristics. Example: (Simple tabulation): Number of students in three colleges. Name of the Colleges No. of students 1. Raja Peary Mohan College 2750 2. Serampore College 3400 3. R.K. Mission Vidyamandira 1400 Example: (Complex tabulation) Number of Students Name of the colleges B.A. (Hons) B.SC. (Hons) B.A. B.Sc. Total 1. R.P.M College 450 400 850 1050 2750 2. Serampore College 600 500 800 1500 3400 3. R.K. Mission Vidyamandira 200 200 300 700 1400 8 Introduction to Biostatistics TITLE BOX HEAD N CAPTIO (1) (2) (3) (4) (5) (6) (7) STUB D Y BO SOURCE------------------- FOOT NOTE------------------ Fig. 1.2. Different parts of table. Statistical Tables: Statistical table is a systematic arrangement of quantitative data under appropriate heads in rows and columns. After the data have been collected, they should be tabulated that is put in the form of a table, so that whole information can be had at a glance. Parts of a Table: (I) Title: (a) This is a brief description of the contents of the table along with time, place and category of item if required. (b) The title should be clear and precise. (c) It should be at the top of the table. (II) Stub: (a) The extreme left part of the table where descriptions of the rows are shown in called stub. (b) It must be precise and clear. (III) Caption and Box head: (a) The upper part of the table which shows the description of columns and sub columns is called Caption. (b) The whole of the upper part including caption units of measurement and column number if any is called Boxhead. (IV) Body: (a) It is the main part of the table except the title stub and captions. (b) It contains numerical information which are arranged in the table according to the descriptions of the rows and columns given the stub and caption. (V) Source and foot note: (a) It is customary that source of data from which information has been arrived should be given at the end of the table. Preliminary Concept 9 (b) Foot note is the part below the body where the source of data and any explanation are shown. Essential features of a good table: 1. A table must have a title giving clear and precise idea about the contents of the table. 2. Units of measurements adopted in a table must be shown clearly in the top of the column. 3. It is a necessity that an investigator prepares a table well proportioned in length and breadth. 4. For a compatible comparison, column of relevant figures must be kept as close as possible. 5. Distinction is preferred in columns and sub columns. It can be made by distinct ruling (viz double ruling, single ruling etc.). 6. Totals of columns may be shown in the bottom of the table. In cases where row totals are useful, they should also be shown. 7. Table must contain necessary details. 8. Source of information must be disclosed at the end of the table. 9. Any ambiguous or confusing entry in the table should bear a special note at the end of the table for experiment. 10. The arrangement of items in the table should have a logical sequence. Satatistic (s) Parameter (i) Any statistical measure calculated on the (i) Any statistical measure based on all units basis of sample observations is called a in the population in called a parameter. statistic. Example - Population mean, population sd. Example : Sample mean, Sample S.d (ii) It characterise population. (ii) It characterise samples. (iii) It is not directly worked out. (iii) It can be directly worked out. (iv) Statistics is a variate i.e. the values of (iv) A parameter is fixed quantity, i.e. the values statistics varies from sample to sample. of parameter is constant. Measure Statistic (s) Parameter Mean x Standard deviation S 2 Variance S 2 Correlation coefficiant r P Find whether the variable is continuous or discontinuous in the following cases: (i) Number of individuals in a family. (ii) Time of flight of a missile. (iii) Number of gallons of water in a washing machine. (iv) Life time of television tubes produced in a company. (B.A. Punjab University, 1967) Ans: (i) Discrete, (ii) Continuous, (iii) Discrete, (iv) Continuous. 2 CHAPTER FREQUENCY DISTRIBUTION Values of Variable of Frequency: The distinct observation is known as values of variable. The values of variable obtained by observations are termed as observed values or observation. If a value repeats more than once, the number of times the value is repeated will be termed as frequency. Frequency Distribution: Frequency Distribution is a statistical table which shows the values of variable arranged in order of magnitude either individually or in groups and also the corresponding frequencies side by side. Types of Frequency Distribution: There are two types of frequency distribution: (a) Simple frequency distribution. (b) Grouped frequency distribution. Simple frequency distribution shows the values of variable individually where as groups frequency distribution shows the values of the variable in groups or intervals. Table 2.1 Simple Frequency Distribution Number of class test Marks obtained (20) 1 14 2 13 3 15 4 12 5 14 6 12 7 11 8 17 Total 108 Table 2.2 Grouped Frequency Distribution Age in years Frequency (No. of persons) 15–19 37 20–24 81 25–29 43 30–34 24 35–44 9 45–59 6 Total 200 10 Frequency Distribution 11 Notes to Remember for Forming Frequency Distribution: (i) Each class must be clearly defined. (ii) Each class must be exhaustive i.e., each raw data must be included in the classes. (iii) Classes must be exclusive i.e., non-overlapping. (iv) It is expected that normally classes should be made of equal width. (v) The number of classes should neither be too large nor too small. Terms Associated with Grouped Frequency Distribution: (i) Class or class interval. (ii) Class limit. (iii) Class boundaries. (iv) Class mark. (v) Class width. (vi) Class frequency, total frequency, percentage frequency, frequency density. (vii) Cumulative Frequency. Class Interval or Class: When a large number of observations varying in a wide range are available, these are classified in several groups according to the size of values. Each of these groups defined by an interval is called class interval or class. Class Intervals are of two types: (a) Continuous Class Interval: A Class interval which does not contain the upper boundary of the class will be called continuous class interval. A class interval of the form 10–20 in continuous class will contain values from 10 to less 20. An example is of the form: Class Range 0–10 From zero less than 10 10–20 From 10 less than 20 20–30 From 20 less than 30 30–40 From 30 less than 40 (b) Discontinuous class interval: A class interval where each class includes the end values will be called discontinuous class interval. A class interval of the form 0–9 in discontinuous class will contain values from 0 to 9 both inclusive. An example is of the form. Class Range 0–9 From 0 to 9 10–19 From 10 to 19 20–29 From 20 to 29 Continuous class intervals are formed generally with continuous type values or non-integral values e.g., Rupees, Kg. etc. Discontinuous class intervals are formed generally with discrete or integral values e.g., marks. Open end Class: When one end of a class is not specified, the class is called open end class. A frequency distribution may have either one or two open end classes. 12 Introduction to Biostatistics Income (Rs.) Frequency 0–50 90 50–100 150 100–150 100 150–200 80 200–250 10 Class Limits: In the construction of groups' frequency distribution, the class interval must be defined by pairs of numbers such that the upper end of one class does not coincide with the lower end of the immediate following class. The two numbers used to specify the limits of a class interval for the purpose of tallying the original observation into the various classes, are called class limits. (i) The smaller of the pair is known as lower class limit. (ii) The larger of the pair is called as upper class limit. Class Boundaries: In most of the measurement of continuous variables, all data are recorded nearest to a certain unit or integer value. The most extreme values which would ever be included in a class interval are called class boundaries. Infact it is the actual or real limits of a class interval. (i) The lower extreme point is called lower class boundary. (ii) The upper extreme point is called upper class boundary. Calculation: If is the gap between the upper class limit of any class or class interval and the lower class limit of the next class or class interval. 1 Lower class boundary = lower class limit . 2 1 Upper class boundary = upper class limit . 2 Class limits are used only for the construction of the grouped frequency distribution but in all statistical calculations and diagrams involving end points of classes (e.g. median, mode, histogram and ogive etc.) Class boundaries are used. Class Mark (or midvalue or midpoint): (i) It is the midvalue of a class or class interval exactly at the middle of the class or class interval. (ii) It lies half way between the class limits or between the class boundaries Lower class limit Upper class limit Class mark 2 (iii) It is used as representative value of the class interval for the calculation of means, & standard deviation, mean deviation etc. Class Width: It is range or length of a class interval or difference between the upper and lower class boundaries. Width of class = Upper class boundary – lower class boundary. Frequency Distribution 13 Class Frequency and Total Frequency: The number of observations falling within a class is called its class frequency or simple frequency. The sum of all the class frequencies is called total frequency. Relative Frequency: It is the ratio of the frequency of the class to the total frequency. Frequency of the class (i) It is not expressed in percentage. Relative frequency of a class Total frequency (ii) Relative frequencies are used to compare two or more frequency distributions or two or more items in the same frequency distribution. Percentage Frequency: Percentage of class interval or class is the frequency of the class interval (class) expressed as percentage of the total frequency distribution. Frequency of the class Percentage frequency of a class 100. Total frequency Table 2.3 Class limit, class boundaries, class width, frequency density, Relative frequency Class Class Class Class Class Class Frequency Relative interval frequency limits boundaries marks width density frequency Lower Upper Lower Upper 1 2 7 8 9 10 3 4 5 6 15–19 18 15 19 14.5 19.5 17 5 3.6.18 20–24 34 20 24 19.5 24.5 22 5 6.8.34 25–29 21 25 29 24.5 29.5 27 5 4.2.21 30–34 12 30 34 29.5 34.5 32 5 2.4.12 35–39 9 35 44 34.5 39.5 39.5 5 1.8.09 40–44 6 45 59 39.5 44.5 52 5 1.2.06 Total 100 — — — — — — — 1.00 Frequency Density: Frequency density of a class interval is its frequency per unit width. It shows the concentration of frequency in a class. Class frequency (i) Frequency density: Width of the class (ii) It is used in drawing histogram when the classes are of unequal width. Cumulative Frequency Distribution: Cumulative frequency corresponding to a class is the sum of all the frequency up to and including that class. I. It is obtained by adding to the frequency of that class and all the frequencies of the previous classes II. Cumulative frequencies are of two types: (a) Less than Cumulative Frequency: The number of observations ‘upto’ a given value is called less than cumulative frequency. 14 Introduction to Biostatistics (b) More than Cumulative Frequency: The number of observations ‘greater than’ a value is called the more than cumulative frequency. Table 2.4 Cumulative frequency Class interval Frequency Less than More than 30–40 8 8 100 40–50 12 20 92 50–60 20 40 80 60–70 25 65 60 70–80 18 83 35 80–90 17 100 17 Total 100 Uses: 1. To find out the number of observations less than or more than any given value. 2. To find out the number of observations falling between any two specified values of the variable. 3. To find out median, quartiles & pentiles. Problem: Form a frequency table (class limit, class boundaries, class width, class density, relative frequency) from the following data: Marks in Biostatistics Number of students Under 30 0 Under 35 10 Under 40 15 Under 45 8 Under 50 5 Solution: Class Class Class limits Class boundaries Class Class Frequency Relative interval frequency Lower Upper Lower Upper marks width density frequency 1 2 3 4 5 6 7 8 9 10 30–35 10 30 35 29.5 35.5 32.5 6 1.66 0.25 35–40 15 35 40 35.5 40.5 37.5 6 2.5 0.375 40–45 8 40 45 40.5 45.5 42.5 6 1.33 0.2 45–50 7 45 50 45.5 50.5 47.5 6 0.833 0.175 Total 40 3 CHAPTER GRAPHICAL REPRESENTATION OF DATA Graphical Representation of Data The representation of quantitative data suitably through charts and diagrams is known as Graphical Representation of Data (Statistical Information). Graph includes both charts and diagrams. The main object of diagrammatic representation is to emphasis the relative position of different subdivisions and not simply to record details. Advantages of Graphical Representation: (i) It is easily understood by all. (ii) The data can be presented in a more attractive form. (iii) It shows the trend and tendency of values of the variable. (iv) Diagrammatic representations are useful to detect mistakes at the time of data computations. (v) It shows relationship between two or more sets of figures. (vi) It has the universal applicability. (vii) It is helpful in assimilating the data readily and quickly. Disadvantages of Graphical Representation: (i) It does not show (details) or all the facts. (ii) Graphical representation can reveal only the approximate position. (iii) It takes a lot of time to prepare of graph. The Different Types of Diagrammatic Representation: There are various types of graphs in the form of charts and diagrams. Some of them are: 1. Line Diagram or Graph. 2. Bar Diagram. 3. Pie Chart. 4. Histogram. 5. Frequency polygon. 6. Ogives (cumulative frequency polygon). Modes of Graphical Representation of Data: The data in the form of raw scores is known as ungrouped data and when it is organized into frequency distribution then it is referred to as grouped data. Separate modes and methods are used to represent these two types of data ungrouped and grouped. 15 16 Introduction to Biostatistics A. Graphical Representation of Ungrouped Data: For the ungrouped data, the following graphical presentations are used. 1. Line diagrams or graphs. 2. Bar diagram. 3. Pie diagram or charts. 4. Pictograms. 1. Line Diagrams or Graphs or (Historigram): It is the most common method of representing statistical information mainly used in business and commerce. (i) These are drawn on the plane paper by plotting the data concerning one variable on the horizontal x-axis (abscissa) and other variable of data on y-axis (ordinate), which intersect at a point called origin. (ii) With the help of such graphs the effect of one variable upon another variable during an experimental study may be clearly demonstrated. (iii) According to data for corresponding X, Y values (in pairs), we will find a point on the graph paper. The points thus generated are then jointed by pieces of straight lines successfully. The figure thus formed is called Line diagram or graph. (iv) Two types of line diagram are used (a) natural scale (b) ratio scale. Example: The data of effect of practice on learning are given in the table. Trial No. 1 2 3 4 5 6 7 8 9 10 11 12 Score 4 5 8 8 10 13 12 12 14 16 16 16 Draw a line graph for the representation and interpretation of the above data. 16 15 Solution: Plot the points (1, 4), (2, 5), (3, 8), 14 (4, 8), (5, 10), (6, 13), (7, 12), (8, 10), (9, 14) and 13 12 (10, 16). 11 10 2. Bar Diagram: A bar diagram is a graph on which 9 Scores the data are represented in the form of bars and 8 is useful in comparing qualitative or quantitative 7 6 data of discrete type. 5 4 (i) It consists of a number of equally spaced 3 rectangular areas with equal width and 2 originates from a horizontal base line 1 0 (X-axis). 1 2 3 4 5 6 7 8 9 10 11 12 0 Trials (ii) The length of the bar is proportional to the value it represents. It should be seen that the Fig. 3.1 Line Graph—The effect of practice on learning bars are neither too short nor too long. (iii) They are shaded or coloured suitably. (iv) The bars may be vertical or horizontal in a bar diagram. If the bars are placed horizontally, it is called horizontal bar diagram, when bars are placed vertically it is called a vertical bar diagram. There are three types of bar diagram: (i) Simple bar diagram (ii) Multiple or grouped bar diagram (iii) Component or subdivided bar diagram. Graphical Representation of Data 17 (i) Simple Bar Diagram: Girls Boys (i) It consists of a number of equally spaced vertical bars of uniform width originating from a horizontal axis and is shaded. (ii) These bars are usually arranged according to relative magnitude of bars. (iii) The length of the bar is determined by the 0 20 40 60 80 100 value or the amount of variables. Fig. 3.2 Simple Bar Diagram showing weight in kg (iv) The limitation of simple bar diagram is that among Boys & Girls only one variable can be represented on it. Example: The heart beat rates of four mammals are given below. Represent it with the help of bar diagram. Mammals Heart beat (per minute) Whale 75 Horse 45 Pig 70 Cow 50 80 70 60 50 40 30 20 10 Whale Horse Pig Cow Fig. 3.3 Bar diagram showing heart beat (minute) of mammals (ii) Multiple or Grouped Bar Diagram: (i) Multiple bar diagram represents more than one type of data at a time. M (ii) In this case numerical values of major Ist Year F categories are arranged in ascending or descending order so that categories M IInd Year can be readily distinguished. F (iii) Different shades or colour are used for each category. 0 50 100 150 200 250 300 350 400 450 500 (iv) It is to be remembered that the gap Fig. 3.4 Multiple or Grouped Bar Diagram showing the between the variable must be same. number of Male & Female students in 1st and 2nd year B.Sc. students. (iii) Component or Subdivided Bar Diagram: (i) Each bar in component bar diagram is subdivided into several component parts. 18 Introduction to Biostatistics (ii) A single bar represents the aggregate value where as the component parts represent the component values of the aggregate value. (iii) It shows the relationship among the different parts and also between the different parts and the main bar. (iv) Different shades or colours are used to distinguish the various components. Example: Students of Biological sciences of Serampore College during 2007 & 2008. Year Zoology (Hons) Botany (Hons) Physiology (Hons) General 2007 16 22 20 52 2008 22 25 18 55 2007 2008 10 20 30 40 50 60 70 80 90 100 110 120 Fig. 3.5 Component Bar diagram 3. Pie Diagram or Chart: It is a circular graph whose area is subdivided into sectors by radii in such a way that the areas of the sectors are proportional to the angles at the centre. (i) The area of the circle represents the total value and the different sectors of the circle represent the different parts. (ii) It is generally used for comparing the relation between the various components of a value and between components and the total value. (iii) It gives comparative difference at a glance. (iv) In pie chart or diagram, the data is expressed as percentage. Each component is expressed as percentage of the total value. (v) The name of the pie diagram is given to a circle diagram because in determining the circumference of a circle we have to take into consideration a quantity known as ‘pie’ (). Working Procedure: (i) The surface area of circle is known to cover 2 radius or 360° (degrees). The data to be represented through a circle diagram may therefore be presented through 360°. (ii) Plot a circle of an appropriate size with pencil and compass. The angle of a circle totals 360°. (iii) Convert the given value of the components of an item in percentage of the total value of the item. (iv) In the pie chart largest sector remains at the top and other in sequence running clockwise. (v) Transpose the various component values correspond to the degree on the circle. Since 100% is represented by 360° angle the centre of the circle, therefore 1% value is represented by 360°/100 = 3.6°. If 5% of a certain component, the angle which represent the percentage of such component is (3.6 × 5) degrees. (vi) Measure with protector, the points on a circle representing the size of each sector. Label each sector for identification. Value of one component Angle 360 Total of all the components Graphical Representation of Data 19 Example: Marks obtained in test examination by a Bioscience students of Serampore College. Zoology Botany Physiology 65 60 55 Solution: Total number of components = 65 + 60 + 55 = 180. Subject Marks obtained Angle 65 Zoology 65 360 130 180 60 Botany 60 360 120 180 55 Physiology 55 360 110 180 4. Pictogram: It is a popular method of representing statistical data in pictures. Fig. 3.6 Pie chart showing marks of three subjects (i) In pictogram a number of pictures of equal size and definite numerical value are drawn. (ii) Each picture represents a number of units. (iii) Pictures are draw side horizontally or vertically. (iv) It is widely used public and private sector. Problem : Represent the following data of the production of books (Biostatistics) from S. Chand & Company. Year 2006 2007 2008 Production of books 1000 2000 3000 Solution: The given data is represented by pictogram as shown in Fig. 3.7. 1 1.5 2 2.5 3 3.5 4 2006 2007 1 = 1000 books 2008 Fig. 3.7 Pictogram. B. Graphical Representation of Grouped Data: For the grouped data, the following graphical presentation are used: 1. Histogram 2. Frequency polygon 3. Cumulative frequency curve (or ogive). 20 Introduction to Biostatistics (a) Histogram: It is the most common form of diagrammatic representation of grouped frequency distribution of both continuous and discontinuous type. (i) It consists of a set of rectangle drawn on a horizontal base line i.e., x-axis (abscissa) and frequency (i.e., number of observations) is marked on the vertical line i.e., y-axis (Ordinate). (ii) The width of each rectangle extends over the class boundaries of the corresponding class along the horizontal axis. (iii) The area of each rectangle is proportional to the frequency in the respective class interval. Area of each rectangle = width × height = width of class × frequency density class frequency = width of class width of class = class frequency Working Procedure: (i) Convert the inclusive series into exclusive series. Inclusive series Age 10–19 20–29 30–39 40–49 Frequency 2 7 5 8 Exclusive Series: Age (class interval) Size of interval Frequency Frequency density Score limit True limit 2 10–19 9.5–19.5 10 2 0.2 10 7 20–29 19.5–29.5 10 7 0.7 10 5 30–39 29.5–39.5 10 5 0.5 10 8 40–49 39.5–49.5 10 8 0.8 10 (ii) The scores in the form of actual class limits as 9.5–19.5, 19.5–29.5, 29.5–39.5 etc. are taken in the construction of histogram rather than the class limits as 10, 19, 20, 29 etc. (iii) It is customary to take two extra class intervals one below and other above the given grouped intervals. 10 9 (iv) Now take actual lower limits of the class intervals 8 (including extra intervals) and plot them in the X-axis. 7 6 The lower limit of lowest interval (one of the 5 extra interval) is taken at the intersecting point of 4 Y 3 x- and y-axes. 2 (v) Each class or interval with its specific frequency is 1 0 represented by a separate rectangle. The base of each 9.5 19.5 29.5 39.5 49.5 rectangle is the width of the class interval and the height X is the respective frequency of that class or interval. Fig. 3.8 Histogram with frequency distribution. Graphical Representation of Data 21 (vi) Frequencies are plotted on the y-axis. (vii) Selection of appropriate units of representation along the x- and y-axes are essential. Both x- and y-axes should not be too short. Types: There are two types of histograms: (a) Histogram with equal class intervals. (b) Histogram with unequal class interval. (a) Histogram with Equal Class intervals: (i) Here the size of class intervals are drawn on x-axis with equal width and their respective frequencies on y-axis. (ii) Class and its frequency taken together form a rectangle. The graph of rectangles is known as histogram. Example: Population of carp fishes in 100 ponds are as follows: Number of carps per ponds 0–100 100–200 200–300 300–400 400–500 500–600 No. of ponds 12 18 27 20 17 6 Solution: This is the case of Histogram with equal class interval. 30 Frequency [No. of Ponds] 20 10 0 0 100 200 300 400 500 600 Histogram showing carp population in different ponds Fig. 3.9 Histogram with equal class intervals. (b) Histogram with Unequal Class Intervals: (i) Here the sizes of class intervals are drawn on axis with unequal width & their respective frequencies on y-axis. (ii) Some have less width and some have more width. So the histogram is drawn on the basis of frequency density not on the basis of frequency. Example: Draw the histogram of the following frequency distribution with unequal class interval: Age group 14–15 16–17 18–20 21–24 25–29 30–34 35–39 No. of people 6 14 12 8 11 10 9 Calculation of Histogram Age group Class-interval Class-boundary Class-width Frequency Frequency (No. of people) density 6 14–15 13.5–15.5 2 6 3 2 14 16–17 15.5–17.5 2 14 7 2 12 18–20 17.5–20.5 3 12 4 3 22 Introduction to Biostatistics Age group Class-interval Class-boundary Class-width Frequency Frequency (No. of people) density 8 21–24 20.5–24.5 4 8 2 4 11 25–29 24.5–29.5 5 11 2.2 5 10 30–34 29.5–34.5 5 10 2 5 9 35–39 34.5–39.5 5 9 1.8 5 15 14 13 12 11 10 9 Frequency 8 7 6 5 4 3 2 1 0 13.5 15.5 17.5 20.5 24.5 29.5 34.5 39.5 Class-boundary (age in years) Fig. 3.10 Histogram with unequal class boundary. Uses: (i) It gives visual representation of the relative size of the various groups. (ii) The surface of the tops of rectangles also give an idea of the nature of the frequency curve for the population. (iii) It (histogram) may be used to find the mode graphically. 2. Frequency Polygon: It is an area diagram represented in the form of curve obtained by joining the middle points of the tops of the rectangles in a histogram or joining the mid-points of class intervals at the height of frequencies by straight lines. It gives a polygon i.e., figures with many angles. (i) The frequency polygon is obtained by joining the successive points whose abscissa represent the mid values and ordinate represent the corresponding class frequencies. (ii) In order to complete the drawing of a polygon, the two end points are joined to the base line at the mid points of the empty classes at each end of the frequency distribution. (iii) Thus the frequency polygon has the same area as the histogram, provided the width of all classes is the same. Graphical Representation of Data 23 Example: Construct a histogram and frequency polygon for the following data: 100–150 150–200 200–250 250–300 300–350 4 6 13 5 2 Solution: We have the case of equal class interval. Class Interval Frequency C.F. 100–150 4 4 150–200 6 10 200–250 13 23 250–300 5 28 300–350 2 30 Y 12 10 Frequency Polygon 8 6 4 2 M 0 X 50 100 150 200 250 300 350 400 Fig. 3.11 Frequency Polygon. Use: (i) It is particularly useful in representing a simple and ungrouped frequency distribution of discrete variables. (ii) It gives an approximate idea of the shape of the frequency curve. 3. Cumulative Frequency Polygon (ogive): The graphical representation of a cumulative frequency distribution where the cumulative frequencies are plotted against the corresponding class boundaries and the successive points are joined by straight lines, the diagram or curve obtained is known as ogive or cumulative frequency polygon. Working Procedure: (i) The upper limits of the classes are represented along x-axis. (ii) The cumulative frequency of a particular class is taken along the y-axis. (iii) The points corresponding to cumulative frequency at each upper limit of the classes are joined by a free hand curve. This curve is called a cumulative frequency curve or an ogive. (iv) In case of frequency polygon, frequencies must be plotted at the upper limit of the class but in the case of an ogive, cumulative frequency is plotted at the upper limit of the class. Example: Draw cumulative frequency polygon or ogives (both ‘less than’ & ‘more than’ types) for the following frequency distribution {diastolic blood pressure (mm Hg)}. Class Intervals 50–59 60–69 70–79 80–89 90–99 100–109 110–119 No. of patients 8 10 16 14 10 5 2 24 Introduction to Biostatistics Solution: Calculation for drawing ogives: Class boundary Cumulative frequency Less than More than 49.5 0 65 59.5 8 57 69.5 18 47 79.5 34 31 89.5 48 17 99.5 58 7 109.5 63 2 119.5 65 0 70 60 M or an e- th s -th an s 50 Le Cumulative frequency 40 30 20 10 M ed ia n 0 49.5 59.5 69.5 79.5 89.5 99.5 109.5 119.5 B.P (nm Hg) Fig. 3.12 Ogives for diastotic blood pressure. Type: (a) More than Ogive: (i) It starts from the highest class boundary on the horizontal axis and gradually rising upwards end. (ii) It looks an elongated letter ‘S’ turned upside down. (b) Less than Ogive: (i) It starts from the lowest class boundary on the horizontal axis and gradually rising upwards at the highest class boundary corresponding to the cumulative frequency i.e., total frequency. (ii) It looks elongated letter ‘S’. Graphical Representation of Data 25 Uses: (i) It is used to find the median, quartiles, decile and percentiles or value of the variables. (ii) It is also useful in finding the cumulative frequency corresponding to a given value of the variable. (iii) To find out the number of observation which are expected to lie between two given values. Comparison between histogram and frequency polygon. Histogram Frequency Polygon 1. It is essentially the bar graph of the given 1. The frequency polygon is a line graph of the frequency distribution. frequency distribution. 2. It does not provide better conception of the 2. It gives much better conception of the contour contour of the distribution. of the distribution. 3. It is poorly useful. 3. It is more useful and practicable. 4. The histogram gives a very clear as well as 4. In the frequency polygon, it is assumed that accurate picture of the relative proportions the frequencies are concentrated at the mid- of frequency from interval to interval. points of the class intervals. It merely points out the graphical relationship between the mid-points and frequencies. Difference between Diagram and Graph Diagram Graph 1. Ordinary paper can be used. 1. Graph paper must be used. 2. It is not helpful for interpolation and extra- 2. It is helpful for interpolation and extrapolation polation technique. technique. 3. The value of median and mode cannot be 3. The value of median and mode can be estimated. estimated. 4. Data are represented by bars, rectangles etc. 4. Data are represented by points or lines of different kinds—dots, dashes etc. 5. It is used for comparison only. 5. It is used for establishing mathematical relationship between two variables. 6. They are attractive & used for publicity. 6. It is useful to researchers & statisticians in data analysis. Meta - Analysis 1. The combining of data from different research studies to gain a better overview of a topic than what was available in any single investigation. 2. The data obtained from combined studies must be comparable in order to be evaluated by this method. 26 Introduction to Biostatistics BOX PLOT Box Plot: Box plot (Box-and whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five number summeries: smallest observation (minimum), lower quartile (Q1), media (Q2), upper quartile (Q3) and largest observation (maximum) with outliers plotted individually. Characteristics: I. A central box spans the quartiles. II. A line in the box marks median. III. Observations more than 1.5 × IQR (inter quartile range) outside the central box plotted individually as possible outliers. IV. Lines called whiskers, extend from the box and to the smallest and the largest observations that are not the outliers. V. Box plots can be drawn either horizontally or vertically. VI. For symmetric distribution, box plots are symmetric but the symmetric box plot does not imply of symmetric distribution. VII. The upper part of the box i.e. difference between Q3 and M is bigger than the lowest part (difference between (M and Q1) in case of positively skewed distribution. and smaller in case of negatively skewed. 140 120 * 100 * * 80 * outliers 60 Q3 40 M 20 Q1 0 Fig. 3.13 Box plot Uses: I. The box plots are most useful for comparing distribution. II. It is a quickway of examining one or more sets of data graphically. III. The spacings between the different parts of the box indicate the degree of dispersions (spread) and skewness in the data and identify outliers. Graphical Representation of Data 27 Inter quartile range (IQR): It is a measure of spread, based on quartiles. The range or extend from third quartile (Q3) to first quartile (Q1) is called interquartile range. IQR = Q3 – Q1 Semi interquartile range or quartile deviation: The half of the difference between the third (Q3) and first (Q1) quartile is called semiquartile or quartile deviation. Outlier: An outlier is defined as an observation which falls more than 1.5 × IQR (called step) above Q3 or below Q1. I. If an observation falls more than 3 × IQR above Q3 or below Q1, then it is is known as extreme outlier. II. The observation falling between 1.5 × IQR and 3 × IQR above Q3 or below Q1 is known as a suspect outlier. Innerfences and outerfences: The values (Q1 – 1.5 IQR, Q3 + 1.5 IQR) are known innerfences. The values (Q1 – 3 IQR, Q3 + 3 IQR) are known as outerfences. Five number summery Minimum Q1 M Q3 Maximum 4 CHAPTER CENTRAL TENDENCY The word ‘average’ denotes a representative of a whole set of observations. It is a single figure which describes the entire series of observations with their varying sizes. It is a typical value occupying a central position where some observations are larger and some others are smaller than it. Average is a general term which describes the centre of a series. The values of variable tend to concentrate around the central value. It is the central part of the distribution and therefore they are also called the measures of central tendency. Characteristics of Central Tendency: 1. It should be rigidly defined: (a) An average should be properly defined so that it has one and only one interpretation. (b) The average should not depend on the personal prejudice and bias of the investigator. 2. It should be based on all items: The average should depend on each and every item of the series. So that if any of the item is dropped, the average itself be altered. 3. It should be easily understood: The desirable property of on average is that it can be readily understood and then only it can be made popular. 4. It should not be unduly affected by the extreme value: (a) The average should depend on each and every times, so we must be aware that no extreme observations could influence unduly on the central value. (b) Due to an extreme observation the central value changes or distorts and it can not be typical for the group values. 5. It should be least affected by the fluctuations of the sampling: If we select different groups of sample we should expect some central value approximately in each sample. 6. It should be easy to interpret: The average can become popular only because of its access for easy computation. 7. It should be easily subjected to further mathematical calculations: An average value could be preferred to others if it is capable to be used for further statistical computation. Measures of Central Tendency: The most common measures of central tendency are: 1. Mean or Arithmetic mean 2. Median 3. Mode. Arithmetic Mean (A.M.): It is obtained by summing up all the observations and dividing the total by the number of observations. (a) Arithmetic mean for ungrouped data or individual observations: It x1, x2, x3,... xn be ‘n’ observations for a variable x, the arithmetic mean x is given by n xi x1 x2 x3.... xn x i 1 n n 28 Central Tendency 29 Steps of Calculation: 1. Add together all the values of X and get x. 2. Divide this total by the number of observations. Example: The following table gives the marks obtained in Statistics of 10 students of B. Com (Hons) in Serampore College. Roll No 1 2 3 4 5 6 7 8 9 10 Marks in Statistics 67 69 66 68 72 63 76 65 70 74 Calculate the arithmetic mean of marks in statistics among these students. Solution: Roll No. Marks 1 67 2 69 3 66 4 68 5 72 6 63 7 76 8 65 9 70 10 74 N = 10 x = 690 x x N Here x = 690 N = 10 690 x 69 10 Thus arithmetic mean of marks obtained in Statistics by students is 69. (b) Arithmetic mean for grouped data (discrete series): Let a variable take n values x1, x2, x3 …, xn having corresponding frequencies f1, f2, f3 … fn n fi xi 1 i 1 then x n x N fx, where N f (sum of all frequencies) fi i 1 Steps: 1. Multiply each variable value with its corresponding frequency and sum them up to get fi xi. 2. Divide this total value fi xi by number of observations i.e., total frequency fi. Example: Find the Arithmetic mean from the frequency table. Marks 30 40 50 60 70 80 90 No. of students 15 20 10 15 20 15 5 30 Introduction to Biostatistics Solution: Marks (x) Number of students (f ) fx 30 15 450 40 20 800 50 10 500 60 15 900 70 20 1400 80 15