PLN115 Quantitative Methods for Planning Lectures 1 and 2 PDF

By one definition, statistics consist of facts and figures such as the average annual snowfall in Denver or Derrick Jeter’s lifetime batting average. These statistics are usually informative and time-saving because they condense large quantities of information into a few simple figures. Specifically, we use the term statistics to refer to a general field of mathematics. In this case, we are using the term statistics as a shortened version of statistical procedures. Research involves gathering information. When researchers finish the task of gathering information, they typically find themselves with pages and pages of measurements such as preferences, personality scores, opinions, and so on. Here, we look at statistics that researchers use to analyze and interpret the information that they gather. 2 Specifically, statistics serve two general purposes: 1. Statistics are used to organize and summarize the information so that the researcher can see what happened in the research study and can communicate the results to others. 2. Statistics help the researcher to answer the questions that initiated the research by determining exactly what general conclusions are justified based on the specific results that were obtained. 3 … typically begins with a general question about a specific situation or a group (or groups) of individuals For example? 4 5 As you can well imagine, a population can be quite large—for example, the entire set of women on the planet Earth. A researcher might be more specific, limiting the population for study to women who are registered voters in the United States. Perhaps the investigator would like to study the population consisting of women who are heads of state. Populations can obviously vary in size from extremely large to very small, depending on how the investigator defines the population. The population being studied should always be identified by the investigator. 6 Because populations tend to be very large, it usually is impossible for a researcher to examine every individual in the population of interest. Therefore, investigators typically select a smaller, more manageable group from the population and limit their studies to the individuals in the selected group. In statistical terms, a set of individuals selected from a population is called a sample. A sample is intended to be representative of its population, and a sample should always be identified in terms of the population from which it was selected. 7 8 9 10 Typically, researchers are interested in specific characteristics of the individuals in the population (or in the sample), or they are interested in outside factors that may influence the individuals. For example, a researcher may be interested in the influence of the weather on people’s moods. As the weather changes, do people’s moods also change? Something that can change or have different values is called a variable. 11 Variables can be characteristics that differ from one individual to another, such as height, weight, gender, or personality. Also, variables can be environmental conditions that change such as temperature, time of day, or the size of the room in which the research is being conducted. To demonstrate changes in variables, it is necessary to make measurements of the variables being examined. The measurement obtained for each individual is called a datum, or more commonly, a score or raw score. The complete set of scores is called the data set or simply the data. 12 13 14 15 Descriptive statistics are techniques that take raw scores and organize or summarize them in a form that is more manageable. Often the scores are organized in a table or a graph so that it is possible to see the entire set of scores. Another common technique is to summarize a set of scores by computing an average. Note that even if the data set has hundreds of scores, the average provides a single descriptive value for the entire set. 16 Inferential statistics are methods that use sample data to make general statements about a population. By analyzing the results from the sample, we hope to make general statements about the population. Typically, researchers use sample statistics as the basis for drawing conclusions about population parameters. 17 One problem with using samples, however, is that a sample provides only limited information about the population. Although samples are generally representative of their populations, a sample is not expected to give a perfectly accurate picture of the whole population. There usually is some discrepancy between a sample statistic and the corresponding population parameter. This discrepancy is called sampling error, and it creates the fundamental problem inferential statistics must always address. 18 19 20 21 22 23 24 25 26 The scores that make up the data from a research study are the result of observing and measuring variables. For example, a researcher may finish a study with a set of IQ scores, personality scores, or reaction-time scores. Some variables, such as height, weight, and eye color are well-defined, concrete entities that can be observed and measured directly. On the other hand, many variables studied by behavioral scientists are internal characteristics that people use to help describe and explain behavior. For example, we say that a student does well in school because he or she is intelligent. Or we say that someone is anxious in social situations, or that someone seems to be hungry. Variables like intelligence, anxiety, and hunger are called constructs, and because they are intangible and cannot be directly observed, they are often called hypothetical constructs. 27 Although constructs such as intelligence are internal characteristics that cannot be directly observed, it is possible to observe and measure behaviors that are representative of the construct. For example, we cannot “see” intelligence but we can see examples of intelligent behavior. The external behaviors can then be used to create an operational definition for the construct. An operational definition defines a construct in terms of external behaviors that can be observed and measured. For example, your intelligence is measured and defined by your performance on an IQ test, or hunger can be measured and defined by the number of hours since last eating. 28 29 The variables in a study can be characterized by the type of values that can be assigned to them. 30 A discrete variable consists of separate, indivisible categories. For this type of variable, there are no intermediate values between two adjacent categories. Consider the values displayed when dice are rolled. Between neighboring values—for example, seven dots and eight dots—no other values can ever be observed. Discrete variables are commonly restricted to whole, countable numbers—for example, the number of children in a family or the number of students attending class. If you observe class attendance from day to day, you may count 18 students one day and 19 students the next day. However, it is impossible ever to observe a value between 18 and 19. A discrete variable may also consist of observations that differ qualitatively. For example, people can be classified by gender (male or female), by occupation (nurse, teacher, lawyer, etc.), and college students can by classified by academic major (art, biology, chemistry, etc.). In each case, the variable is discrete because it consists of separate, indivisible categories. 31 On the other hand, many variables are not discrete. Variables such as time, height, and weight are not limited to a fixed set of separate, indivisible categories. You can measure time, for example, in hours, minutes, seconds, or fractions of seconds. These variables are called continuous because they can be divided into an infinite number of fractional parts. Suppose, for example, that a researcher is measuring weights for a group of individuals participating in a diet study. Because weight is a continuous variable, it can be pictured as a continuous line. Note that there are an infinite number of possible points on the line without any gaps or separations between neighboring points. For any two different points on the line, it is always possible to find a third value that is between the two points. 32 Two other factors apply to continuous variables: When measuring a continuous variable, it should be very rare to obtain identical measurements for two different individuals. Because a continuous variable has an infinite number of possible values, it should be almost impossible for two people to have exactly the same score. If the data show a substantial number of tied scores, then you should suspect that the measurement procedure is very crude or that the variable is not really continuous. When measuring a continuous variable, each measurement category is actually an interval that must be defined by boundaries. For example, two people who both claim to weigh 150 pounds are probably not exactly the same weight. However, they are both around 150 pounds. One person may actually weigh 149.6 and the other 150.3. Thus, a score of 150 is not a specific point on the scale but instead is an interval. To differentiate a score of 150 from a score of 149 or 151, we must set up boundaries on the scale of measurement. These boundaries are called real limits and are positioned exactly halfway between adjacent scores. Thus, a score of X = 150 pounds is actually an interval bounded by a lower real limit of 149.5 at the bottom and an upper real limit of 150.5 at the top. Any individual whose weight falls between these real limits will be assigned a score of X = 150. 33 34 35 Data collection requires that we make measurements of our observations. Measurement involves assigning individuals or events to categories. The categories can simply be names such as male/female or employed/unemployed, or they can be numerical values such as 68 inches or 175 pounds. The categories used to measure a variable make up a scale of measurement, and the relationships between the categories determine different types of scales. The distinctions among the scales are important because they identify the limitations of certain types of measurements and because certain statistical procedures are appropriate for scores that have been measured on some scales but not on others. If you were interested in people’s heights, for example, you could measure a group of individuals by simply classifying them into three categories: tall, medium, and short. However, this simple classification would not tell you much about the actual heights of the individuals, and these measurements would not give you enough information to calculate an average height for the group. Although the simple classification would be adequate for some purposes, you would need more sophisticated measurements before you could answer more detailed questions. We examine four different scales of measurement, beginning with the simplest and moving to the most sophisticated. 36 The word nominal means “having to do with names.” Measurement on a nominal scale involves classifying individuals into categories that have different names but are not related to each other in any systematic way. For example, if you were measuring the academic majors for a group of college students, the categories would be art, biology, business, chemistry, and so on. Each student would be classified in one category according to his or her major. The measurements from a nominal scale allow us to determine whether two individuals are different, but they do not identify either the direction or the size of the difference. If one student is an art major and another is a biology major we can say that they are different, but we cannot say that art is “more than” or “less than” biology and we cannot specify how much difference there is between art and biology. Other examples of nominal scales include classifying people by race, gender, or occupation. 37 Although the categories on a nominal scale are not quantitative values, they are occasionally represented by numbers. For example, the rooms or offices in a building may be identified by numbers. You should realize that the room numbers are simply names and do not reflect any quantitative information. Room 109 is not necessarily bigger than Room 100 and certainly not 9 points bigger. It also is fairly common to use numerical values as a code for nominal categories when data are entered into computer programs. For example, the data from a survey may code males with a 0 and females with a 1. Again, the numerical values are simply names and do not represent any quantitative difference. The scales that follow do reflect an attempt to make quantitative distinctions. 38 39 The categories that make up an ordinal scale not only have different names (as in a nominal scale) but also are organized in a fixed order corresponding to differences of magnitude. Often, an ordinal scale consists of a series of ranks (first, second, third, and so on) like the order of finish in a horse race. Occasionally, the categories are identified by verbal labels like small, medium, and large drink sizes at a fast-food restaurant. In either case, the fact that the categories form an ordered sequence means that there is a directional relationship between categories. With measurements from an ordinal scale, you can determine whether two individuals are different and you can determine the direction of difference. 40 However, ordinal measurements do not allow you to determine the size of the difference between two individuals. In a NASCAR race, for example, the first-place car finished faster than the second-place car, but the ranks don’t tell you how much faster. Other examples of ordinal scales include socioeconomic class (upper, middle, lower) and T-shirt sizes (small, medium, large). In addition, ordinal scales are often used to measure variables for which it is difficult to assign numerical scores. For example, people can rank their food preferences but might have trouble explaining “how much” they prefer chocolate ice cream to steak. 41 42 Both an interval scale and a ratio scale consist of a series of ordered categories (like an ordinal scale) with the additional requirement that the categories form a series of intervals that are all exactly the same size. Thus, the scale of measurement consists of a series of equal intervals, such as inches on a ruler. Other examples of interval and ratio scales are the measurement of time in seconds, weight in pounds, and temperature in degrees Fahrenheit. Note that, in each case, one interval (1 inch, 1 second, 1 pound, 1 degree) is the same size, no matter where it is located on the scale. 43 The factor that differentiates an interval scale from a ratio scale is the nature of the zero point. An interval scale has an arbitrary zero point. That is, the value 0 is assigned to a particular location on the scale simply as a matter of convenience or reference. In particular, a value of zero does not indicate a total absence of the variable being measured. For example a temperature of 0º Fahrenheit does not mean that there is no temperature, and it does not prohibit the temperature from going even lower. Interval scales with an arbitrary zero point are relatively rare. The two most common examples are the Fahrenheit and Celsius temperature scales. Other examples include golf scores (above and below par) and relative measures such as above and below average rainfall. 44 A ratio scale is anchored by a zero point that is not arbitrary but rather is a meaningful value representing none (a complete absence) of the variable being measured. The existence of an absolute, non-arbitrary zero point means that we can measure the absolute amount of the variable; that is, we can measure the distance from 0. This makes it possible to compare measurements in terms of ratios. For example, a gas tank with 10 gallons (10 more than 0) has twice as much gas as a tank with only 5 gallons (5 more than 0). Also note that a completely empty tank has 0 gallons. With a ratio scale, we can measure the direction and the size of the difference between two measurements and we can describe the difference in terms of a ratio. Ratio scales are quite common and include physical measures such as height and weight, as well as variables such as reaction time or the number of errors on a test. 45 46 47 48 49 Scales of measurement are important because they help determine the statistics that are used to evaluate the data. Specifically, there are certain statistical procedures that are used with numerical scores from interval or ratio scales and other statistical procedures that are used with non-numerical scores from nominal or ordinal scales. The distinction is based on the fact that numerical scores are compatible with basic arithmetic operations (adding, multiplying, and so on) but non-numerical scores are not. For example, if you measure IQ scores for a group of students, it is possible to add the scores together to find a total and then calculate the average score for the group. On the other hand, if you measure the academic major for each student, you cannot add the scores to obtain a total. For most statistical applications, the distinction between an interval scale and a ratio scale is not important because both scales produce numerical values that permit us to compute differences between scores, to add scores, and to calculate mean scores. On the other hand, measurements from nominal or ordinal scales are typically not numerical values, do not measure distance, and are not compatible with many basic arithmetic operations. Therefore, alternative statistical techniques are necessary for data from nominal or ordinal scales of measurement. 50 51 52 53 54 Measuring a variable in a research study yields a value or a score for each individual. Raw scores are the original, unchanged scores obtained in the study. Scores for a particular variable are typically represented by the letter X. The letter N is used to specify how many scores are in a set. An uppercase letter N identifies the number of scores in a population and a lowercase letter n identifies the number of scores in a sample. 56 Many of the computations required in statistics involve adding a set of scores. Because this procedure is used so frequently, a special notation is used to refer to the sum of a set of scores. The Greek letter sigma, or Σ, is used to stand for summation. Order of Mathematical Operations 1. Any calculation contained within parentheses is done first. 2. Squaring (or raising to other exponents) is done second. 3. Multiplying and/or dividing is done third. A series of multiplication and/or division operations should be done in order from left to right. 4. Summation using the Σ notation is done next. 5. Finally, any other addition and/or subtraction is done. 57 Statistics are a set of numerical data → the phenomenon under study must be capable of quantitative measurement The raw material of Statistics always originates from the operation of counting (enumeration) or measurement. Some key terms: Investigator: The person who conducts the statistical enquiry i.e., counts or measures the characteristics under study for further statistical analysis Respondent: The persons from whom the information is collected Statistical unit: The items on which the measurements are taken Collection of statistical data: The process of counting or enumeration or measurement together with the systematic recording of results 59 Before we embark upon the collection of data for a given statistical enquiry, it is imperative to examine carefully the following points which may be termed as preliminaries to data collection : i. Objectives and scope of the enquiry. ii. Statistical units to be used. iii. Sources of information (data). iv. Method of data collection. v. Degree of accuracy aimed at in the final results. vi. Type of enquiry. 60 A well-defined and identifiable object or a group of objects with which the measurements or counts in any statistical investigation are associated is called a statistical unit. For example, in a socio-economic survey the unit may be an individual person, a family, a household or a block of locality. A very important step before the collection of data begins is to define clearly the statistical units on which the data are to be collected. In a number of situations the units are conventionally fixed like the physical units of measurement such as metres, kilometres, kilograms, quintals, hours, days, weeks, etc., which are well defined and do not need any elaboration or explanation. However in many statistical investigations, particularly relating to socio-economic studies, arbitrary units are used which must be clearly defined. This is imperative since in the absence of a clear-cut and precise definition of the statistical units, serious errors in the data collection may be committed in the sense that we may collect irrelevant data on the items, which should have, in fact, been excluded and omit data on certain items which should have been included. This will ultimately lead to fallacious conclusions. 61 Primary data – If the data are collected originally by the investigator for the given enquiry Secondary data – if he makes use of the data which had been earlier collected by some one else 62 If primary data are to be collected a decision has to be taken whether (i) census method or (ii) sample technique, is to be used for data collection. In the census method, we resort to 100% inspection of the population and enumerate each and every unit of the population. In the sample technique we inspect or study only a selected representative and adequate fraction (finite subset) of the population and after analysing the results of the sample data we draw conclusions about the characteristics of the population. 63 The statistical enquiries may be of different types as outlined below: (i) Official, Semi-official or Un-official. (ii) Initial or Repetitive. (iii) Confidential or Non-confidential. (iv) Direct or Indirect. (v) Regular or Ad-hoc. (vi) Census or Sample. (vii) Primary or Secondary. 64 (i) Direct personal investigation - collection of data personally by the investigator (organising agency) from the sources concerned (ii) Indirect oral interviews - In these types of enquiries factual data on different problems are collected by interviewing persons who are directly or indirectly concerned with the subject matter of the enquiry and who are in possession of the requisite information. The method also consists in collection of the data through enumerators appointed for this purpose. (iii) Information received through local agencies - This method consists in the appointment of local agents (commonly called correspondents) by the investigator in different parts of field of enquiry. These correspondents or agencies in different regions collect the information according to their own ways, fashions, likings and decisions and then submit their reports periodically to the central or head office where the data are processed for final analysis. 65 (iv) Mailed questionnaire method - This method consists in preparing a questionnaire (a list of questions relating to the field of enquiry and providing space for the answers to be filled by the respondents) which is mailed to the respondents with a request for quick response within the specified time. (v) Schedules sent through enumerators – In this method the enumerators go to the respondents personally with the schedule (list of questions), ask them the questions there in and record their replies. Note: Questionnaire - a list of questions which are answered by the respondent himself in this own handwriting Schedule - the device of obtaining answers to the questions in a form which is filled by the interviewers or enumerators (the field agents who put these questions) in a face to face situation with the respondents 66 The chief sources of secondary data may be broadly classified into the following two groups: 1. Published Sources. There are a number of national (government, semi-government and private) organisations and also international agencies which collect statistical data relating to various fields and publish their findings in statistical reports on a regular basis (monthly, quarterly, annually, ad-hoc). Official publications of Central government Publications of semi-government statistical organizations Publications of research institutions Publications of commercial and financial institutions Reports of various committees and commissions appointed by the government Newspapers and periodicals International publications 2. Unpublished Sources. The statistical data need not always be published. There are various sources of unpublished statistical material such as the records maintained by private firms or business enterprises who may not like to release their data to any outside agency ; the various departments and offices of the Central and State Governments ; the researches carried out by the individual research scholars in the universities or research institutes. 67 The questionnaire should be designed or drafted with utmost care and caution so that all the relevant and essential information for the enquiry may be collected without any difficulty, ambiguity and vagueness. Drafting of a good questionnaire is a highly specialised job and requires great care, skill, wisdom, efficiency and experience. The following general points may be borne in mind: 1. The size of the questionnaire should be as small as possible. 2. The questions should be clear, brief, unambiguous, non-offending, courteous in tone, corroborative in nature and to the point so that not much scope of guessing is left on the part of the respondents. 3. The questions should be arranged in a natural logical sequence. 4. The usage of vague and ‘multiple meaning’ words should be avoided. 5. Questions should be so designed that they are readily comprehensible and easy to answer for the respondents. 6. Questions of a sensitive and personal nature should be avoided. 69 7. Types of Questions. The questions in the questionnaire may be broadly classified as follows: (a) Shut Questions. In such questions possible answers are suggested by the framers of the questionnaire and the respondent is required to tick one of them. Shut questions can further be sub-divided into the following forms: (i) Simple Alternate Questions. In such questions, the respondent has to choose between two clear cut alternatives like ‘Yes or No’ ; ‘Right or Wrong’ ; ‘Either, Or’ and so on. For instance, Do you own a refrigerator ?—Yes or No. Such questions are also called dichotomous questions. This technique can be applied with elegance to situations where two clear cut alternatives exist. (ii) Multiple Choice Questions. Quite often, it is not possible to define a clear cut alternative and accordingly in such a situation either the first method (Alternate Questions) is not used or additional answers between Yes and No like Do not know, No opinion, Occasionally, Casually, Seldom, etc., are added. 70 71 The questions in the questionnaire may be broadly classified as follows: (b) Open Questions. Open questions are those in which no alternative answers are suggested and the respondents are at liberty to express their frank and independent opinions on the problem in their own words. For instance, ‘What are the drawbacks in our examination system’ ? ; ‘What solution do you suggest to the housing problem in Delhi’ ? ; ‘Which programme in the Delhi TV do you like best’ ? ; are some of the open questions. Since the views of the respondents in the open questions might differ widely, it is very difficult to tabulate the diverse opinions and responses. 72 8. Leading questions should be avoided. 9. Cross Checks. The questionnaire should be so designed as to provide internal checks on the accuracy of the information supplied by the respondents by including some connected questions at least with respect to matters which are fundamental to the enquiry. 10. Pre-testing the Questionnaire. From practical point of view it is desirable to try out the questionnaire on a small scale (i.e., on a small cross- section of the population for which the enquiry is intended) before using it for the given enquiry on a large scale. 11. A Covering Letter. A covering letter from the organisers of the enquiry should be enclosed along with the questionnaire. 12. Mode of tabulation and analysis viz., hand operated, machine tabulation or computerisation should also be kept in mind while designing the questionnaire. 13. The questionnaire should be made attractive by proper layout and appealing get up. 73 74 75 76 77 The presentation of the data is broadly classified into the following two categories: (i) Tabular Presentation. (ii) Diagrammatic or Graphic Presentation. A statistical table is an orderly and logical arrangement of data into rows and columns and it attempts to present the voluminous and heterogeneous data in a condensed and homogeneous form. This process of arranging the data into groups or classes according to resemblances and similarities is technically called classification. Classification of the data is preliminary to its tabulation. Diagrams and graphs are pictorial devices for presenting the statistical data. 79 Classification impresses upon the ‘arrangement of the data into different classes, which are to be determined depending upon the nature, objectives and scope of the enquiry. For instances the number of students registered in a college during a particular academic year may be classified on the basis of any of the following criterion : (i) Sex (ii) Age (iii) The state to which they belong (iv) Religion (v) Different faculties, like Arts, Science, Humanities, Law, Commerce, etc. (vi) Heights or weights (vii) Institutions (Colleges) and so on. The facts in one class will differ from those of another class with respect to some characteristic called the basis or criterion of classification. The technique of dividing the given data into different classes with respect to more than one basis simultaneously is called cross-classification. 80 Functions of classification: condenses the data facilitates comparison helps to study the relationships facilitates the statistical treatment of the data 81 Rules of classification: It should be unambiguous. The classes should be rigidly defined so that they should not lead to any ambiguity. In other words, there should not be any room for doubt or confusion regarding the placement of the observations in the given classes. It should be exhaustive and mutually exclusive. The classification must be exhaustive in the sense that each and every item in the data must belong to one of the classes. A good classification should be free from the residual class like ‘others’ or ‘miscellaneous’ because such classes do not reveal the characteristics of the data completely. Further, the various classes should be mutually disjoint or non-overlapping so that an observed value belongs to one and only one of the classes. It should be stable. In order to have meaningful comparisons of the results, an ideal classification must be stable i.e., the same pattern of classification should be adopted throughout the analysis and also for further enquiries on the same subject. It should be suitable for the purpose. The classification must be in keeping with the objectives of the enquiry. It should be flexible. A good classification should be flexible in that it should be adjustable to the new and changed situations and conditions. 82 Bases of Classification: The bases or the criteria with respect to which the data are classified primarily depend on the objectives and the purpose of the enquiry. Generally, the data can be classified on the following four bases : (i) Geographical i.e., Area-wise or Regional (ii) Chronological i.e., with respect to occurrence of time (iii) Qualitative i.e., with respect to some character or attribute (iv) Quantitative i.e., with respect to numerical values or magnitudes 83 84 85 86 87 The organisation of the data pertaining to a quantitative phenomenon involves the following four stages: (i) The set or series of individual observations - unorganised (raw) or organised (arrayed) data. (ii) Discrete or ungrouped frequency distribution. (iii) Grouped frequency distribution. (iv) Continuous frequency distribution. 88 → raw or disorganised data 89 1. Array A better presentation of the above raw data would be to arrange them in an ascending or descending order of magnitude which is called the ‘arraying’ of the data. However, this presentation (arraying), though better than the raw data does not reduce the volume of the data. 90 2. Discrete or Ungrouped Frequency Distribution A much better way of the representation of the data is to express it in the form of a discrete or ungrouped frequency distribution where we count the number of times each value of the variable occurs. This is facilitated through the technique of Tally Marks or Tally Bar. 91 3. Grouped Frequency Distribution If the identity of the units about whom a particular information is collected is not relevant, nor is the order in which the observations occur, then the first real step of condensation consists in classifying the data into different classes (or class intervals) by dividing the entire range of the values of the variable into a suitable number of groups called classes and then recording the number of observations in each group (or class). The various groups into which the values of the variable are classified are known as classes or class intervals; the length of the class interval is called the width or magnitude of the classes. The two values specifying the class are called the class limits; the larger value is called the upper class limit and the smaller value is called the lower class limit. 92 4. Continuous Frequency Distribution While dealing with a continuous variable it is not desirable to present the data into a grouped frequency distribution. For example, if we consider the ages of a group of students in a school, then we form continuous class intervals, (without any gaps), of the following type: Age in years : Below 6 6 or more but less than 9 9 or more but less than 12 12 or more but less than 15 and so on, which takes care of all the students with any fractions of age. 93 1. Types of Classes The classes should be clearly defined and should not lead to any ambiguity. Further, they should be exhaustive and mutually exclusive (i.e., non-overlapping) so that any value of the variable corresponds to one and only one of the classes. In other words, there is one to one correspondence between the value of the variable and the class. 94 2. Number of Classes Although no hard and fast rule exists, a choice about the number of classes (class intervals) into which a given frequency distribution can be divided primarily depends upon: (i) The total frequency (i.e., total number of observations in the distribution), (ii) The nature of the data i.e., the size or magnitude of the values of the variable, (iii) The accuracy aimed at, and (iv) The ease of computation of the various descriptive measures of the frequency distribution such as mean, variance, etc., for further processing of the data. From a practical point of view the number of classes should neither be too small nor too large. If too few classes are used, the classification becomes very broad and rough in the sense that too many frequencies will be concentrated or crowded in a single class. This might obscure some important features and characteristics of the data, thereby resulting in loss of information. Further, larger grouping error may occur. Too many classes i.e., large number of classes will result in too few frequencies in each class. Moreover, a large number of classes will render the distribution too unwieldy to handle, thus defeating the very purpose. 95 2. Number of Classes A number of rules of the thumb have been proposed for calculating the proper number of classes. However, an elegant, though approximate formula seems to be one given by Prof. Sturges known as Sturges’ rule 96 2. Number of Classes The number of class intervals should be such that they usually give uniform and unimodal distribution in the sense that the frequencies in the given classes first increase steadily, reach a maximum and then decrease steadily. There should not be any sudden jumps or falls which result in the so-called irregular distribution. The maximum frequency should not occur in the very beginning or at the end of the distribution nor should it (maximum frequency) be repeated in which cases we shall get an irregular distribution. The number of classes should be a whole number (integer) preferably 5 or some multiple of 5 viz., 10, 15, 20, 25, etc., which are readily perceptible to the mind and are quite convenient for numerical computations in the further processing (statistical analysis) of the data. Uncommon figures like 3, 7, 11, etc., should be avoided as far as possible. 97 3. Size of class intervals Since the size of the class interval is inversely proportional to the number of classes (class intervals) in a given distribution, from the above discussion it is obvious that a choice about the size of the class interval will also largely depend on the sound subjective judgement of the statistician keeping in mind other considerations like N (total frequency), nature of the data, accuracy of the results and computational ease for further processing of the data. Here an approximate value of the magnitude (or width) of the class interval, say, ‘i’ can be obtained by using Sturges’ rule which gives: Another ‘rule of the thumb’ for determining the size of the class interval is that : “The length of the class interval should not be greater than 1/4 th of the estimated population standard deviation.” 98 3. Size of class intervals Like the number of classes, as far as possible, the size of class intervals should also be taken as 5 or some multiple of 5 viz., 10, 15, 20, etc., for facilitating computations of the various descriptive measures of the frequency distribution like mean (x–), standard deviation (σ), etc. Class intervals should be so fixed that each class has a convenient mid-point about which all the observations in the class cluster or concentrate. In other words, this amounts to saying that the entire frequency of the class is concentrated at the mid-value of the class. This assumption will be true only if the frequencies of the different classes are uniformly distributed in the respective class intervals. This is a very fundamental assumption in the statistical theory for the computation of various statistical measures, like mean, standard deviation, etc. From the point of view of practical convenience, as far as possible, it is desirable to take the class intervals of equal or uniform magnitude throughout the frequency distribution. This will lead to ease of further computation and representation. Sometimes, it may not be practicable nor desirable to keep the magnitudes of the class intervals equal if there are very wide gaps in the observed data. 99 4. Types of Class Intervals (a) Inclusive Type Classes. The classes of the type 30—39, 40—49, 50—59, 60—69, etc., in which both the upper and lower limits are included in the class are called “inclusive classes”. For instance, the class interval 40—49 includes all the values from 40 to 49, both inclusive. The next value viz., 50 is included in the next class 50—59 and so on. However, the fractional values between 49 and 50 cannot be accounted for in such a classification. Hence, ‘Inclusive Type’ of classification may be used for a grouped frequency distribution for discrete variables like marks in a test, number of accidents on the road, etc., where the variable takes only integral values. It cannot be used with advantage for the frequency distribution of continuous variables like age, height, weight, etc., where all values (integral as well as fractional) are permissible. 100 4. Types of Class Intervals (b) Exclusive Type Classes. classes in which upper limits are excluded from the respective classes and are included in the immediate next class are termed as ‘exclusive classes’. 101 4. Types of Class Intervals (b) Exclusive Type Classes. – Class boundaries If in a grouped frequency distribution there are gaps between the upper limit of any class and lower limit of the succeeding class (as in the case of inclusive type of classification), there is a need to convert the data into a continuous distribution by applying a correction for continuity for determining new classes of exclusive type. The upper and lower class limits of the new ‘exclusive type’ classes as called class boundaries. 102 4. Types of Class Intervals (b) Exclusive Type Classes. - Mid-value or Class Mark The mid-value or the class-mark is the value of the variable which is exactly at the middle of the class. The mid-value of any class is obtained on dividing the sum of the upper and lower class limits (or class boundaries) by 2. 103 4. Types of Class Intervals (c) Open End Classes. The classification is termed as ‘open end classification’ if the lower limit of the first class or the upper limit of the last class are not specified Such classes in which one of the limits is missing are called ‘open end classes’. For example, the classes like the marks less than 20; age above 60 years, salary not exceeding Rupees 100 or salaries over Rupees 200, etc., are ‘open end classes’ since one of the class limits (lower or upper) is not specified in them. As far as possible, open end classes should be avoided since in such classes the mid-value or class-mark cannot be accurately obtained and this poses problems in the computation of various statistical measures for further processing of the data. Moreover, open end classes present problems in graphic presentation of the data also. However, the use of open end classes is inevitable or unavoidable in a number of practical situations, particularly relating to economic and medical data where there are a few observations with extremely small or large values while most of the other observations are more or less concentrated in a narrower range. 104 105 106

PLN115 Quantitative Methods for Planning Lectures 1 and 2 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue