Organisation Of Data PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is about data organization and classification. It explains how to classify data for further statistical analysis, and how to prepare frequency distribution tables, form classes and differentiate between univariate and bivariate frequency distributions.
Full Transcript
CHAPTER Organisation of Data census and sampling. In this chapter, Studying this chapter should you will know how the data, that you enable you to:...
CHAPTER Organisation of Data census and sampling. In this chapter, Studying this chapter should you will know how the data, that you enable you to: collected, are to be classified. The classify the data for further purpose of classifying raw data is to statistical analysis; distinguish between quantitative bring order in them so that they can be and qualitative classification; subjected to further statistical analysis prepare a frequency distribution easily. table; Have you ever observed your local know the technique of forming junk dealer or kabadiwallah to whom classes; you sell old newspapers, broken be familiar with the method of tally household items, empty glass bottles, marking; plastics, etc? He purchases these differentiate between univariate things from you and sells them to those and bivariate frequency distributions. who recycle them. But with so much junk in his shop it would be very difficult for him to manage his trade, if 1. INTRODUCTION he had not organised them properly. In the previous chapter you have learnt To ease his situation he suitably about how data is collected. You also groups or “classifies” various junk. He came to know the difference between puts old newspapers together and 2024-25 ORGANISATION OF DATA 23 ties them with a rope. Then collects all junk according to the markets for empty glass bottles in a sack. He heaps reused goods. For example, under the the articles of metals in one corner of group “Glass” he would put empty his shop and sorts them into groups bottles, broken mirrors and like “iron”, “copper”, “aluminium”, windowpanes, etc. Similarly when you “brass” etc., and so on. In this way he classify your history books under the groups his junk into different classes group “History” you would not put a — “newspapers, “plastics”, “glass”, book of a different subject in that “metals” etc. — and brings order in group. Otherwise the entire purpose of them. Once his junk is arranged and grouping would be lost. Classification, classified, it becomes easier for him to therefore, is arranging or organising find a particular item that a buyer may things into groups or classes based demand. on some criteria. Likewise when you arrange your schoolbooks in a certain order, it Activity becomes easier for you to handle them. Visit your local post-office to find You may classify them according to out how letters are sorted. Do you know what the pin-code in a letter indicates? Ask your postman. 2. RAW DATA Like the kabadiwallah’s junk, the unclassified data or raw data are highly disorganised. They are often very large and cumbersome to handle. To draw meaningful conclusions from them is a tedious task because they do not yield to statistical methods easily. subjects where each subject becomes Therefore proper organisation and a group or a class. So, when you need presentation of such data is needed a particular book on history, for before any systematic statistical instance, all you need to do is to search analysis is undertaken. Hence after that book in the group “History”. collecting data the next step is to Otherwise, you would have to search organise and present them in a through your entire collection to find classified form. the particular book you are looking for. Suppose you want to know the While classification of objects or performance of students in things saves our valuable time and mathematics and you have collected effort, it is not done in an arbitrary data on marks in mathematics of 100 manner. The kabadiwallah groups his students of your school. If you present 2024-25 24 STATISTICS FOR ECONOMICS them as a table, they may appear Table 3.2 something like Table 3.1. Monthly Household Expenditure (in Rupees) on Food of 50 Households TABLE 3.1 1904 1559 3473 1735 2760 Marks in Mathematics Obtained by 100 2041 1612 1753 1855 4439 Students in an Examination 5090 1085 1823 2346 1523 47 45 10 60 51 56 66 100 49 40 1211 1360 1110 2152 1183 60 59 56 55 62 48 59 55 51 41 1218 1315 1105 2628 2712 42 69 64 66 50 59 57 65 62 50 4248 1812 1264 1183 1171 64 30 37 75 17 56 20 14 55 90 1007 1180 1953 1137 2048 62 51 55 14 25 34 90 49 56 54 2025 1583 1324 2621 3676 70 47 49 82 40 82 60 85 65 66 1397 1832 1962 2177 2575 49 44 64 69 70 48 12 28 55 65 1293 1365 1146 3222 1396 49 40 25 41 71 80 0 56 14 22 66 53 46 70 43 61 59 12 30 35 then you have to first arrange the marks 45 44 57 76 82 39 32 14 90 25 of 100 students either in ascending or Or you could have collected data in descending order. That is a tedious on the monthly expenditure on food of task. It becomes more tedious, if instead 50 households in your neighbourhood of 100 you have the marks of 1,000 to know their average expenditure on students to handle. Similarly, in Table food. The data collected, in that case, 3.2, you would note that it is difficult had you presented as a table, would for you to ascertain the average have resembled Table 3.2. Both Tables monthly expenditure of 50 3.1 and 3.2 are raw or unclassified households. And this difficulty will go data. In both the tables you find that up manifold if the number was larger — say, 5,000 households. Like our kabadiwallah, who would be distressed to find a particular item when his junk becomes large and disarranged, you would face a similar situation when you try to get any information from raw data that are large. In one word, therefore, it is a tedious task to pull information from large unclassified data. The raw data are summarised, and made comprehensible by classification. When facts of similar characteristics are placed in the same class, it enables one numbers are not arranged in any order. to locate them easily, make Now if you are asked for the highest comparison, and draw inferences marks in mathematics from Table 3.1 without any difficulty. You have 2024-25 ORGANISATION OF DATA 25 studied in Chapter 2 that the 3. CLASSIFICATION OF DATA Government of India conducts Census The groups or classes of a classification of population every ten years. About is done in various ways. Instead of 20 crore persons were contacted in classifying your books according to Census 2001. The raw data of census subjects — “History”, “Geography”, are so large and fragmented that it “Mathematics”, “Science”, etc. — you appears an almost impossible task to could have classified them author-wise draw any meaningful conclusion from in an alphabetical order. Or, you could them. But when the same data is have also classified them according to classified according to gender, the year of publication. The way you education, marital status, occupation, want to classify them would depend on etc., the structure and nature of your requirement. population of India is, then, easily Likewise the raw data is classified in understood. various ways depending on the The raw data consist of purpose. They can be grouped observations on variables. The raw data according to time. Such a classification as given in Tables 3.1 and 3.2 consist is known as a Chronological of observations on a specific or group Classification. In such a classification, of variables. Look at Table 3.1 for data are classified either in ascending or instance which contains marks in in descending order with reference to mathematics scored by 100 students. time such as years, quarters, months, How can we make sense of these weeks, etc. The following example shows marks? The mathematics teacher the population of India classified in looking at these marks would be terms of years. The variable ‘population’ thinking– How have my students done? is a Time Series as it depicts a series of How many have not passed? How we values for different years. classify the data depends upon the Example 1 purpose we have in mind. In this case, Population of India (in crores) the teacher wishes to understand in Year Population (Crores) some depth– how these students have done. She would probably choose to 1951 35.7 construct the frequency distribution. 1961 43.8 1971 54.6 This is discussed in the next section. 1981 68.4 Activity 1991 81.8 2001 102.7 Collect data of total weekly 2011 121.0 expenditure of your family for a year and arrange it in a table. See In Spatial Classification the data how many observations you have. are classified with reference to Arrange the data monthly and geographical locations such as find the number of observations. countries, states, cities, districts, etc. 2024-25 26 STATISTICS FOR ECONOMICS status, etc. They cannot be measured. Yet these attributes can be classified on the basis of either the presence or the absence of a qualitative characteristic. Such a classification of data on attributes is called a Qualitative Classification. In the following example, we find population Example 2 shows the yeild of wheat in of a country is grouped on the basis of different countries. the qualitative variable “gender”. An observation could either be a male or a Example 2 female. These two characteristics could Yield of Wheat for Different Countries be further classified on the basis of (2013) marital status as given below: Country Yield of wheat (kg/hectare) Example 3 Canada 3594 China 5055 Population France 7254 Germany 7998 India 3154 Male Female Pakistan 2787 Source: Indian Agricultural Statistics at a Glance, 2015 Married Unmarried Married Unmarried Activities The classification at the first stage is In Example 1, find out the years based on the presence and absence of in which India’s population was minimum and maximum, an attribute, i.e., male or not male In Example 2, find the country (female). At the second stage, each class whose yield of wheat is slightly — male and female, is further sub- more than that of India’s. How divided on the basis of the presence or much would that be in terms of absence of another attribute, i.e., percentage? whether married or unmarried. Arrange the countries of Characteristics, like height, weight, Example 2 in the ascending age, income, marks of students, etc., order of yield. Do the same are quantitative in nature. When the exercise for the descending order collected data of such characteristics of yield. are grouped into classes, it becomes a Sometimes you come across Quantitative Classification. characteristics that cannot be expressed quantitatively. Such Activity characteristics are called Qualities or The objects around can be grouped Attributes. For example, nationality, as either living or non-living. Is it literacy, religion, gender, marital a quantitative classification? 2024-25 ORGANISATION OF DATA 27 Example 4 criterion. They are broadly classified Frequency Distribution of Marks in into two types: Mathematics of 100 Students (i) Continuous and Marks Frequency (ii) Discrete. 0–10 1 A continuous variable can take any 10–20 8 numerical value. It may take integral 20–30 6 30–40 7 values (1, 2, 3, 4,...), fractional values 40–50 21 (1/2, 2/3, 3/4,...), and values that are 50–60 23 not exact fractions ( 2 =1.414, 60–70 19 3 =1.732, … , 7 =2.645). For example, 70–80 6 the height of a student, as he/she grows 80–90 5 90–100 4 say from 90 cm to 150 cm, would take all the values in between them. It can Total 100 take values that are whole numbers like 90cm, 100cm, 108cm, 150cm. It can Example 4 shows the quantitative also take fractional values like 90.85 classification of marks in mathematics of 100 students given in Table 3.1. cm, 102.34 cm, 149.99cm etc. that are not whole numbers. Thus the variable Activity “height” is capable of manifesting in every conceivable value and its values Express the values of frequency can also be of Example 4 as proportion or broken percentage of total frequency. down into Note that frequency expressed infinite in this way is known as relative gradations. frequency. In Example 4, which class has Other examples of a continuous the maximum concentration of variable are weight, time, distance, etc. data? Express it as percentage Unlike a continuous variable, a of total observations. Which discrete variable can take only certain class has the minimum values. Its value changes only by finite concentration of data? “jumps”. It “jumps” from one value to another but does not take any 4. VARIABLES: CONTINUOUS AND intermediate value between them. For DISCRETE example, a variable like the “number of students in a class”, for different A simple definition of variable, classes, would assume values that are which you have read in the last only whole numbers. It cannot take any chapter, does not tell you how it varies. fractional value like 0.5 because “half Variables differ on the basis of specific of a student” is absurd. Therefore it 2024-25 28 STATISTICS FOR ECONOMICS cannot take a value like 5. WHAT IS A FREQUENCY DISTRIBUTION? 25.5 between 25 and 26. A frequency distribution is a Instead its value could comprehensive way to classify raw data have been either 25 or of a quantitative variable. It shows how 26. What we observe is different values of a variable (here, the that as its value changes marks in mathematics scored by a from 25 to 26, the values student) are distributed in different in between them — the classes along with their corresponding fractions are not taken by class frequencies. In this case we have it. But we should not ten classes of marks: 0–10, 10–20, … , have the impression that 90–100. The term Class Frequency a discrete variable cannot take any means the number of values in a fractional value. Suppose X is a particular class. For example, in the variable that takes values like 1/8, 1/ class 30– 40 we find 7 values of marks 16, 1/32, 1/64,... Is it a discrete from raw data in Table 3.1. They are variable? Yes, because though X takes 30, 37, 34, 30, 35, 39, 32. The fractional values it cannot take any frequency of the class: 30–40 is thus value between two adjacent fractional 7. But you might be wondering why values. It changes or “jumps” from 1/ 40–which is occurring twice in the raw 8 to 1/16 and from 1/16 to 1/32. But data – is not included in the class 30– it cannot take a value in between 1/8 40. Had it been included the class and 1/16 or between 1/16 and 1/32. frequency of 30–40 would have been 9 instead of 7. The puzzle would be clear Activity to you if you are patient enough to read this chapter carefully. So carry on. You Distinguish the following variables as continuous and will find the answer yourself. discrete: Each class in a frequency Area, volume, temperature, distribution table is bounded by Class number appearing on a dice, Limits. Class limits are the two ends of crop yield, population, rainfall, a class. The lowest value is called the number of cars on road and age. Lower Class Limit and the highest value the Upper Class Limit. For example, the class limits for the class: Example 4 shows how the marks 60–70 are 60 and 70. Its lower class of 100 students are grouped into limit is 60 and its upper class limit is classes. You will be wondering as to 70. Class Interval or Class Width is how we got it from the the difference between the upper class raw data of Table 3.1. But, before we limit and the lower class limit. For the address this question, class 60–70, the class interval is 10 you must know what a frequency (upper class limit minus lower class distribution is. limit). 2024-25 ORGANISATION OF DATA 29 The Class Mid-Point or Class Mark is the middle value of a class. It lies halfway between the lower class limit and the upper class limit of a class and can be ascertained in the following manner: Class Mid-Point or Class Mark = (Upper Class Limit + Lower Class Limit)/2 The class mark or mid-value of each Fig.3.1: Diagrammatic Presentation of Frequency class is used to represent the class. Distribution of Data. Once raw data are grouped into classes, How to prepare a Frequency individual observations are not used in further calculations. Instead, the class Distribution? mark is used. While preparing a frequency distribution, the following five TABLE 3.3 The Lower Class Limits, the Upper Class questions need to be addressed: Limits and the Class Mark 1. Should we have equal or unequal Class Frequency Lower Upper Class sized class intervals? Class Class Mark 2. How many classes should we have? Limit Limit 3. What should be the size of each 0–10 1 0 10 5 10–20 8 10 20 15 class? 20–30 6 20 30 25 4. How should we determine the class 30–40 7 30 40 35 limits? 40–50 21 40 50 45 5. How should we get the frequency 50–60 23 50 60 55 60–70 19 60 70 65 for each class? 70–80 6 70 80 75 80–90 5 80 90 85 Should we have equal or unequal 90–100 4 90 100 95 sized class intervals? Frequency Curve is a graphic There are two situations in which representation of a frequency unequal sized intervals are used. First, distribution. Fig. 3.1 shows the when we have data on income and diagrammatic presentation of the other similar variables where the range frequency distribution of the data in is very high. For example, income per our example above. To obtain the day may range from nearly Zero to frequency curve we plot the class marks many hundred crores of rupees. In on the X-axis and frequency on the Y- such a situation, equal class intervals axis. are not suitable because (i) if the class 2024-25 30 STATISTICS FOR ECONOMICS intervals are of moderate size and equal, interlinked. We cannot decide on one there would be a large number of without deciding on the other. classes. (ii) If class intervals are large, In Example 4, we have the number we would tend to suppress information of classes as 10. Given the value of on either very small levels or very high range as 100, the class intervals are levels of income. Second, if a large number of values automatically 10. Note that in the are concentrated in a small part of the present context we have chosen class range, equal class intervals would lead intervals that are equal in magnitude. to lack of information on many values. However, we could have chosen class In all other cases, equal sized class intervals that are not of equal intervals are used in frequency magnitude. In that case, the classes distributions. would have been of unequal width. How many classes should we have? How should we determine the class limits? The number of classes is usually between six and fifteen. In case, we are Class limits should be definite and using equal sized class intervals then clearly stated. Generally, open-ended number of classes can be the calculated classes such as “70 and over” or “less by dividing the range (the difference than 10” are not desirable. between the largest and the smallest The lower and upper class limits values of variable) by the size of the should be determined in such a manner class intervals. that frequencies of each class tend to concentrate in the middle of the class Activities intervals. Find the range of the following: population of India in Example 1, Class intervals are of two types: yield of wheat in Example 2. (i) Inclusive class intervals: In this case, values equal to the lower and upper limits of a class are included in What should be the size of each the frequency of that same class. class? (ii) Exclusive class intervals: In this The answer to this question depends case, an item equal to either the upper on the answer to the previous question. or the lower class limit is excluded from Given the range of the variable, we can the frequency of that class. determine the number of classes once In the case of discrete variables, we decide the class interval. Thus, we both exclusive and inclusive class find that these two decisions are intervals can be used. 2024-25 ORGANISATION OF DATA 31 In the case of continuous variables, intervals “0 to 10” and “20 to 30” inclusive class intervals are used very respectively. This can be called the case often. of lower limit excluded. Examples Or else we could put the values 10, 30 etc., into the class intervals “10 to Suppose we have data on marks 20” and “30 to 40” respectively. This obtained by students in a test and all can be called the case of upper limit the marks are in full numbers excluded. (fractional marks are not allowed). Suppose the marks obtained by the Example of Continuous Variable students vary from 0 to 100. This is a case of a discrete variables Suppose we have data on a variable since fractional marks are not allowed. such as height (centimeters) or weight In this case, if we are using equal sized (kilograms). This data is of the class intervals and decide to have 10 continuous type. In such cases the class intervals then the class intervals class intervals may be defined in the can take either of the following forms: following manner: 30 Kg - 39.999... Kg Inclusive form of class intervals: 40 Kg - 49.999... Kg 0-10 11-20 50 Kg - 59.999... Kg etc. 21-30 These class intervals are - understood in the following manner: - 30 Kg and above and under 40 Kg 91-100 40 Kg and above and under 50 Kg Exclusive form of class intervals: 50 Kg and above and under 60 Kg, etc. 0-10 10-20 TABLE 3.4 20-30 Frequency Distribution of Incomes of 550 - Employees of a Company - Income (Rs) Number of Employees 90-100 800–899 50 In the case of exclusive class 900–999 100 intervals, we have to decide in advance 1000–1099 200 1100–1199 150 what is to be done if we get a value equal 1200–1299 40 to the value of a class limit. For example 1300–1399 10 we could decide that values such as 10, Total 550 30 etc., should be put into the class 2024-25 32 STATISTICS FOR ECONOMICS Adjustment in Class Interval value of class-mark would be modified A close observation of the Inclusive as the following: Method in Table 3.4 would show that Adjusted Class Mark = (Adjusted though the variable “income” is a Upper Class Limit + Adjusted Lower continuous variable, no such Class Limit)/2. continuity is maintained when the classes are made. We find “gap” or TABLE 3.5 discontinuity between the upper limit Frequency Distribution of Incomes of 550 Employees of a Company of a class and the lower limit of the next class. For example, between the upper Income (Rs) Number of Employees limit of the first class: 899 and the lower 799.5–899.5 50 limit of the second class: 900, we find a 899.5–999.5 100 999.5–1099.5 200 “gap” of 1. Then how do we ensure the 1099.5–1199.5 150 continuity of the variable while 1199.5–1299.5 40 classifying data? This is achieved by 1299.5–1399.5 10 making an adjustment in the class Total 550 interval. The adjustment is done in the following way: How should we get the frequency 1. Find the difference between the for each class? lower limit of the second class and the upper limit of the first class. For In simple terms, frequency of an example, in Table 3.4 the lower limit observation means how many times of the second class is 900 and that observation occurs in the raw the upper limit of the first class is data. In our Table 3.1, we observe that 899. The difference between them the value 40 occurs thrice; 0 and 10 is 1, i.e. (900 – 899 = 1) occur only once; 49 occurs five times 2. Divide the difference obtained in (1) and so on. Thus the frequency of 40 is by two i.e. (1/2 = 0.5) 3, 0 is 1, 10 is 1, 49 is 5 and so on. 3. Subtract the value obtained in (2) But when the data are grouped into from lower limits of all classes (lower classes as in example 3, the Class class limit – 0.5) Frequency refers to the number of 4. Add the value obtained in (2) to values in a particular class. The upper limits of all classes (upper counting of class frequency is done by class limit + 0.5). tally marks against the particular class. After the adjustment that restores Finding class frequency by tally continuity of data in the frequency marking distribution, the Table 3.4 is modified into Table 3.5 A tally (/) is put against a class for each After the adjustments in class limits, student whose marks are included in the equality (1) that determines the that class. For example, if the marks 2024-25 ORGANISATION OF DATA 33 TABLE 3.6 Tally Marking of Marks of 100 Students in Mathematics Class Observations Tally Frequency Class Mark Mark 0–10 0 / 1 5 10–20 10, 14, 17, 12, 14, 12, 14, 14 //// /// 8 15 20–30 25, 25, 20, 22, 25, 28 //// / 6 25 30–40 30, 37, 34, 39, 32, 30, 35, //// // 7 35 40–50 47, 42, 49, 49, 45, 45, 47, 44, 40, 44, //// //// //// 49, 46, 41, 40, 43, 48, 48, 49, 49, 40, //// / 41 21 45 50–60 59, 51, 53, 56, 55, 57, 55, 51, 50, 56, //// //// //// 59, 56, 59, 57, 59, 55, 56, 51, 55, 56, //// /// 55, 50, 54 23 55 60–70 60, 64, 62, 66, 69, 64, 64, 60, 66, 69, //// //// //// 62, 61, 66, 60, 65, 62, 65, 66, 65 //// 19 65 70–80 70, 75, 70, 76, 70, 71 ///// 6 75 80–90 82, 82, 82, 80, 85 //// 5 85 90–100 90, 100, 90, 90 //// 4 95 Total 100 obtained by a student are 57, we put a raw data making it concise and tally (/) against class 50 –60. If the comprehensible, it does not show the marks are 71, a tally is put against the details that are found in raw data. class 70–80. If someone obtains 40 There is a loss of information in marks, a tally is put against the class classifying raw data though much is 40–50. Table 3.6 shows the tally gained by summarising it as a marking of marks of 100 students in classified data. Once the data are mathematics from Table 3.1. grouped into classes, an individual The counting of tally is made easier observation has no significance in when four of them are put as //// and further statistical calculations. In the fifth tally is placed across them as Example 4, the class 20–30 contains 6. Tallies are then counted as observations: 25, 25, 20, 22, 25 and groups of five. So if there are 16 tallies 28. So when these data are grouped as in a class, we put them as a class 20–30 in the frequency / for the sake of convenience. distribution, the latter provides only the Thus frequency in a class is equal to number of records in that class (i.e. the number of tallies against that class. frequency = 6) but not their actual values. All values in this class are Loss of Information assumed to be equal to the middle The classification of data as a frequency value of the class interval or class distribution has an inherent mark (i.e. 25). Further statistical shortcoming. While it summarises the calculations are based only on the 2024-25 34 STATISTICS FOR ECONOMICS values of class mark and not on the notice that most of the observations are values of the observations in that concentrated in classes 40–50, class. This is true for other classes as 50–60 and 60–70. Their respective well. Thus the use of class mark instead frequencies are 21, 23 and 19. It means of the actual values of the observations that out of 100 students, 63 in statistical methods involves (21+23+19) students are concentrated considerable loss of information. in these classes. Thus, 63 per cent are However, being able to make more in the middle range of 40-70. The sense of the raw data as shown more than makes this up. remaining 37 per cent of data are in classes 0–10, 10–20, 20–30, 30–40, Frequency distribution with 70–80, 80–90 and 90–100. These unequal classes classes are sparsely populated with observations. Further you will also By now you are familiar with frequency distributions of equal class intervals. notice that observations in these classes You know how they are constructed out deviate more from their respective class of raw data. But in some cases marks than in comparison to those in frequency distributions with unequal other classes. But if classes are to be class intervals are more appropriate. If formed in such a way that class marks you observe the frequency distribution coincide, as far as possible, to a value of Example 4, as in Table 3.6, you will around which the observations in a TABLE 3.7 Frequency Distribution of Unequal Classes Class Observations Frequency Class Mark 0–10 0 1 5 10–20 10, 14, 17, 12, 14, 12, 14, 14 8 15 20–30 25, 25, 20, 22, 25, 28 6 25 30–40 30, 37, 34, 39, 32, 30, 35, 7 35 40–45 42, 44, 40, 44, 41, 40, 43, 40, 41 9 42.5 45–50 47, 49, 49, 45, 45, 47, 49, 46, 48, 48, 49, 49 12 47.5 50–55 51, 53, 51, 50, 51, 50, 54 7 52.5 55–60 59, 56, 55, 57, 55, 56, 59, 56, 59, 57, 59, 55, 56, 55, 56, 55 16 57.5 60–65 60, 64, 62, 64, 64, 60, 62, 61, 60, 62, 10 62.5 65–70 66, 69, 66, 69, 66, 65, 65, 66, 65 9 67.5 70–80 70, 75, 70, 76, 70, 71 6 75 80–90 82, 82, 82, 80, 85 5 85 90–100 90, 100, 90, 90 4 95 Total 100 2024-25 ORGANISATION OF DATA 35 class tend to concentrate, then unequal The class marks of the table are plotted class interval is more appropriate. on X-axis and the frequencies are plotted on Y-axis. Table 3.7 shows the same frequency distribution of Table 3.6 in terms of Activity unequal classes. Each of the classes 40– If you compare Figure 3.2 with 50, 50–60 and 60–70 are split into two Figure 3.1, what do you observe? class 40–50 is divided into 40–45 and 45– Do you find any difference 50. The class 50–60 is divided into 50– between them? Can you explain 55 and 55–60. And class 60–70 is divided the difference? into 60–65 and 65–70. The new classes 40–45, 45–50, 50–55, 55–60, 60–65 and Frequency array 65–70 have class interval of 5. The other So far we have discussed the classes: 0–10, 10–20, 20–30, 30–40, 70– classification of data for a continuous 80, 80–90 and 90–100 retain their old variable using the example of class interval of 10. The last column of percentage marks of 100 students in this table shows the new values of class mathematics. For a discrete variable, marks for these classes. Compare them the classification of its data is known with the old values of class marks in Table as a Frequency Array. Since a discrete variable takes values and not 3.6. Notice that the observations in these intermediate fractional values between classes deviated more from their old class two integral values, we have frequencies mark values than their new class mark that correspond to each of its integral values. Thus the new class mark values values. are more representative of the data in these The example in Table 3.8 illustrates classes than the old values. a Frequency Array. Figure 3.2 shows the frequency curve of the distribution in Table 3.7. Table 3.8 Frequency Array of the Size of Households Size of the Number of Household Households 1 5 2 15 3 25 4 35 5 10 6 5 7 3 8 2 Fig. 3.2: Frequency Curve Total 100 2024-25 36 STATISTICS FOR ECONOMICS The variable “size of the household” and the values of advertisement is a discrete variable that only takes expenditure are classed in different integral values as shown in the table. rows. Each cell shows the frequency of the corresponding row and column 6. BIVARIATE FREQUENCY DISTRIBUTION values. For example, there are 3 firms Very often when we take a sample whose sales are between Rs 135 and Rs145 lakh and their advertisement from a population we collect more than expenditures are between Rs 64 and one type of information from each Rs 66 thousand. The use of a bivariate element of the sample. For example, distribution would be taken up in suppose we have taken sample of 20 Chapter 8 on correlation. companies from the list of companies based in a city. Suppose that we collect 7. CONCLUSION information on sales and expenditure The data collected from primary and on advertisements from each secondary sources are raw or company. In this case, we have unclassified. Once the data are bivariate sample data. Such bivariate collected, the next step is to classify data can be summarised using a them for further statistical analysis. Bivariate Frequency Distribution. Classification brings order in the data. A Bivariate Frequency Distribution The chapter enables you to know how can be defined as the frequency data can be classified through distribution of two variables. frequency distribution in a Table 3.9 shows the frequency comprehensive manner. Once you distribution of two variables, sales and know the techniques of classification, advertisement expenditure (in Rs. it will be easy for you to construct a lakhs) of 20 companies. The values of frequency distribution, both for sales are classed in different columns continuous and discrete variables. TABLE 3.9 Bivariate Frequency Distribution of Sales (in Lakh Rs) and Advertisement Expenditure (in Thousand Rs) of 20 Firms 115–125 125–135 135–145 145–155 155–165 165–175 Total 62–64 2 1 3 64–66 1 3 4 66–68 1 1 2 1 5 68–70 2 2 4 70–72 1 1 1 1 4 Total 4 5 6 3 1 1 20 2024-25 ORGANISATION OF DATA 37 Recap Classification brings order to raw data. A Frequency Distribution shows how the different values of a variable are distributed in different classes along with their corresponding class frequencies. Either the upper class limit or the lower class limit is excluded in the Exclusive Method. Both the upper and the lower class limits are included in the Inclusive Method. In a Frequency Distribution, further statistical calculations are based only on the class mark values, instead of values of the observations. The classes should be formed in such a way that the class mark of each class comes as close as possible, to a value around which the observations in a class tend to concentrate. EXERCISES 1. Which of the following alternatives is true? (i) The class midpoint is equal to: (a) The average of the upper class limit and the lower class limit. (b) The product of upper class limit and the lower class limit. (c) The ratio of the upper class limit and the lower class limit. (d) None of the above. (ii) The frequency distribution of two variables is known as (a) Univariate Distribution (b) Bivariate Distribution (c) Multivariate Distribution (d) None of the above (iii) Statistical calculations in classified data are based on (a) the actual values of observations (b) the upper class limits (c) the lower class limits (d) the class midpoints (iv) Range is the (a) difference between the largest and the smallest observations (b) difference between the smallest and the largest observations (c) average of the largest and the smallest observations (d) ratio of the largest to the smallest observation 2024-25 38 STATISTICS FOR ECONOMICS 2. Can there be any advantage in classifying things? Explain with an example from your daily life. 3. What is a variable? Distinguish between a discrete and a continuous variable. 4. Explain the ‘exclusive’ and ‘inclusive’ methods used in classification of data. 5. Use the data in Table 3.2 that relate to monthly household expenditure (in Rs) on food of 50 households and (i) Obtain the range of monthly household expenditure on food. (ii) Divide the range into appropriate number of class intervals and obtain the frequency distribution of expenditure. (iii) Find the number of households whose monthly expenditure on food is (a) less than Rs 2000 (b) more than Rs 3000 (c) between Rs 1500 and Rs 2500 6. In a city 45 families were surveyed for the number of Cell phones they used. Prepare a frequency array based on their replies as recorded below. 1 3 2 2 2 2 1 2 1 2 2 3 3 3 3 3 3 2 3 2 2 6 1 6 2 1 5 1 5 3 2 4 2 7 4 2 4 3 4 2 0 3 1 4 3 7. What is ‘loss of information’ in classified data? 8. Do you agree that classified data is better than raw data? Why? 9. Distinguish between univariate and bivariate frequency distribution. 10. Prepare a frequency distribution by inclusive method taking class interval of 7 from the following data. 28 17 15 22 29 21 23 27 18 12 7 2 9 4 1 8 3 10 5 20 16 12 8 4 33 27 21 15 3 36 27 18 9 2 4 6 32 31 29 18 14 13 15 11 9 7 1 5 37 32 28 26 24 20 19 25 19 20 6 9 11. “The quick brown fox jumps over the lazy dog” Examine the above sentence carefully and note the numbers of letters in each word. Treating the number of letters as a variable, prepare a frequency array for this data. 2024-25 ORGANISATION OF DATA 39 Suggested Activity From your old mark-sheets find the marks that you obtained in mathematics in the previous class half yearly or annual examinations. Arrange them year-wise. Check whether the marks you have secured in the subject is a variable or not. Also see, if over the years, you have improved in mathematics. 2024-25