Business Statistics AIMA 1st Year PDF
Document Details
Uploaded by FavoritePeridot3230
All India Management Association
P N Mishra and R S Bhardwaj
Tags
Summary
This document is study material for a first-year business statistics course offered by the AIMA. It covers a range of topics, including basic statistical measures, graphic representation of data, probability, sampling, hypothesis testing, and time series analysis. It's an introductory-to-intermediate level text for a management-focused audience.
Full Transcript
Business Statistics Study Material for GM-03 ALL INDIA MANAGEMENT ASSOCIATION CENTRE FOR MANAGEMENT EDUCATION SIM DEVELOPMENT TEAM Members Dr. Ganesh Singh Prof. Anuja Pandey Prof. Amit Bhatnagar Prof. Gurbandini Kaur Prof. R K Singh Prof. Ritesh Saxena Prof. Sarah Nasim Author P N Mishra a...
Business Statistics Study Material for GM-03 ALL INDIA MANAGEMENT ASSOCIATION CENTRE FOR MANAGEMENT EDUCATION SIM DEVELOPMENT TEAM Members Dr. Ganesh Singh Prof. Anuja Pandey Prof. Amit Bhatnagar Prof. Gurbandini Kaur Prof. R K Singh Prof. Ritesh Saxena Prof. Sarah Nasim Author P N Mishra and R S Bhardwaj Produced by Excel Books Private Limited for AIMA-CME Management House, 14 Institutional Area, Lodhi Road, New Delhi-110003 CONTENTS Page No. Unit 1 Basic Statistics with Summary Measures and Graphic Representation of Data 1 1.1 Statistics: An Introduction 1.2 Statistical Series 1.3 Construction of a Frequency Distribution 1.4 Bivariate and Multivariate Frequency Distributions 1.5 Graphic Presentation of Data 1.6 Time Series Graphs or Historigrams 1.7 Logarithmic Graphs or Ratio Charts 1.8 Graph of a Frequency Distribution 1.9 Measures of Central Tendency 1.10 Various Measures of Average 1.11 Arithmetic Mean 1.12 Median 1.13 Other Partition or Positional Measures 1.14 Mode 1.15 Relation between Mean, Median and Mode 1.16 Measure of Dispersion 1.17 Moments, Skewness and Kurtosis 1.18 Moments 1.19 Skewness 1.20 Kurtosis 1.21 Summary 1.22 Keywords 1.23 Review Questions 1.24 Further Readings Unit 2 Basics of Probability 57 2.1 Introduction 2.2 Classical Definition of Probability 2.3 Counting Techniques 2.4 Statistical or Empirical Definition of Probability 2.5 Axiomatic or Modern Approach to Probability 2.6 Important Theorems 2.7 Bayes Theorem 2.8 Probability Distribution 2.9 Binomial Distribution 2.10 Uniform Distribution (Discrete Random Variable) 2.11 Poisson Distribution 2.12 Uniform Distribution (Continuous Variable) 2.13 Normal Distribution 2.14 Summary 2.15 Keywords 2.16 Review Questions 2.17 Further Readings Unit 3 Sampling Methods and Statistical Distribution 98 3.1 Introduction 3.2 Sample Investigation 3.3 Sampling 3.4 Random (or Probability) Sampling Methods 3.5 Non-Random Sampling Methods 3.6 Theoretical Basis of Sampling 3.7 Central Limit Theorem 3.8 Summary 3.9 Keywords 3.10 Review Questions 3.11 Further readings Unit 4 Hypothesis Testing 108 4.1 Introduction 4.2 Test of Hypothesis 4.3 Tests of Hypothesis Concerning Mean 4.4 Tests of Hypothesis Concerning Proportion 4.5 Chi-Square (2) Distribution 4.6 Applications of 2 test 4.7 Summary 4.8 Keywords 4.9 Review Questions 4.10 Further Readings Unit 5 Analysis of Variance 122 5.1 Introduction 5.1 Example where Anova is Applicable 5.2 Principle Governing Such a Procedure 5.3 Assumptions for Anova 5.4 Procedure for Conducting The Test 5.5 Interpretation of The Anova Table 5.6 Summary 5.7 Review Exercises 5.8 Further Readings Unit 6 Correlation, Linear Regression and Time Series Analysis 131 6.1 Introduction 6.2 Forecasting 6.3 Conceptual Model 6.4 Mathematical Model 6.5 Algorithms and Applications 6.6 Regression Algorithm 6.7 Simple Linear Regression Analysis 6.8 Time Series Analysis 6.9 Component of a Time Series 6.10 Time Series Forecasting Methods 6.11 The Mean Absolute Deviation (MAD) 6.12 Mean Squared Error (MSE) 6.13 Seasonal Variations 6.14 Summary 6.15 Keywords 6.16 Review Questions 6.17 Further Readings UNIT 1 BASIC STATISTICS WITH SUMMARY MEASURES AND GRAPHIC REPRESENTATION OF DATA L E A R N I N G O B J E C T I V E S After studying this unit, you would be able to: Understand statistical series Construct frequency distributions and graphs Analyse measure of central tendency U N I T S T R U C T U R E 1.1 Statistics: An Introduction 1.2 Statistical Series 1.3 Construction of a Frequency Distribution 1.4 Bivariate and Multivariate Frequency Distributions 1.5 Graphic Presentation of Data 1.6 Time Series Graphs or Historigrams 1.7 Logarithmic Graphs or Ratio Charts 1.8 Graph of a Frequency Distribution 1.9 Measures of Central Tendency 1.10 Various Measures of Average 1.11 Arithmetic Mean 1.12 Median 1.13 Other Partition or Positional Measures 1.14 Mode 1.15 Relation between Mean, Median and Mode 1.16 Measure of Dispersion 1.17 Moments, Skewness and Kurtosis 1.18 Moments 1.19 Skewness 1.20 Kurtosis 1.21 Summary 1.22 Keywords 1.23 Review Questions 1.24 Further Readings Business Statistics 1.1 STATISTICS: AN INTRODUCTION Origin and Growth of Statistics Statistics, as a subject, has a very long history. The origin of STATISTICS is indicated by the word itself which seems to have been derived either from the Latin word 'STATUS' or from the Italian word 'STATISTA' or may be from the German word 'STATISTIK.' The meaning of all these words is 'political state'. Every State administration in the past collected and analysed data. The data regarding population gave an idea about the possible military strength and the data regarding material wealth of a country gave an idea about the possible source of finance to the State. Similarly, data were collected for other purposes also. On examining the historical records of various ancient countries, one might find that almost all the countries had a system of collection of data. In ancient Egypt, the data on population and material wealth of the country were collected as early as 3050 B.C., for the construction of pyramids. Census was conducted in Jidda in 2030 B.C. and the population was estimated to be 38,00,000. The first census of Rome was done as early as 435 B.C. After the 15th century the work of publishing the statistical data was also started but the first analysis of data on scientific basis was done by Captain John Graunt in the 17th century. His first work on social statistics, 'Observation on London Bills of Mortality' was published in 1662. During the same period the gamblers of western countries had started using statistics, because they wanted to know the more precise estimates of odds at the gambling table. This led to the development of the 'Theory of Probability'. Although the tradition of collection of data and its use for various purposes is very old, the development of modern statistics as a subject is of recent origin. The development of the subject took place mainly after sixteenth century. The notable mathematicians who contributed to the development of statistics are Galileo, Pascal, De-Mere, Farment and Cardeno of the 17th century. Then in later years the subject was developed by Abraham De Moivre (1667-1754), Marquis De Laplace (1749-1827), Karl Friedrich Gauss (1777- 1855), Adolphe Quetelet (1796-1874), Francis Galton (1822-1911), etc. Karl Pearson (1857-1937), who is regarded as the father of modern statistics, was greatly motivated by the researches of Galton and was the first person to be appointed as Galton Professor in the University of London. William S. Gosset (1876-1937), a student of Karl Pearson, propounded a number of statistical formulae under the pen-name of 'Student'. R.A. Fisher is yet another notable contributor to the field of statistics. His book 'Statistical Methods for Research Workers', published in 1925, marks the beginning of the theory of modern statistics. Among the noteworthy Indian scholars who contributed to statistics are P.C. Mahalnobis, V.K.R.V. Rao, R.C. Desai, P.V. Sukhatme, etc. Meaning and Definition of Statistics The meaning of the word 'Statistics' is implied by the pattern of development of the subject. Since the subject originated with the collection of data and then, in later years, the techniques of analysis and interpretation were developed, the word 'statistics' has been used in both the plural and the singular sense. Statistics, in plural sense, means a set of numerical figures or data. In the singular sense, it represents a method of study and therefore, refers to statistical principles and methods developed for analysis and interpretation of data. 2 Statistics as Data Basic Statistics with Summary Measures and Statistics used in the plural sense implies a set of numerical figures collected with reference Graphic Representation of data to a certain problem under investigation. It may be noted here that any set of numerical figures cannot be regarded as statistics. There are certain characteristics which must be satisfied by a given set of numerical figures in order that they may be termed as statistics. Characteristics of Statistics as Data On the basis of the above definitions we can now state the following characteristics of statistics as data: 1. Statistics are numerical facts: In order that any set of facts can be called as statistics or data, it must be capable of being represented numerically or quantitatively. Ordinarily, the facts can be classified into two categories: (a) Facts that are measurable and can be represented by numerical measurements. Measurement of heights of students in a college, income of persons in a locality, yield of wheat per acre in a certain district, etc., are examples of measurable facts. (b) Facts that are not measurable but we can feel the presence or absence of the characteristics. Honesty, colour of hair or eyes, beauty, intelligence, smoking habit etc., are examples of immeasurable facts. Statistics or data can be obtained in such cases also, by counting the number of individuals in different categories. 2. Statistics are aggregate of facts: A single numerical figure cannot be regarded as statistics. Similarly, a set of unconnected numerical figures cannot be termed as statistics. Statistics means an aggregate or a set of numerical figures which are related to one another. 3. Statistics are affected to a marked extent by a multiplicity of factors: Statistical data refer to measurement of facts in a complex situation, e.g., business or economic phenomena are very complex in the sense that there are a large number of factors operating simultaneously at a given point of time. Most of these factors are even difficult to identify. We know that quantity demanded of a commodity, in a given period, depends upon its price, income of the consumer, prices of other commodities, taste and habits of the consumer. Similarly, the sale of a firm in a given period is affected by a large number of factors. Data collected under such conditions are called statistics or statistical data. 4. Statistics are either enumerated or estimated with reasonable standard of accuracy:This characteristic is related to the collection of data. Data are collected either by counting or by measurement of units or individuals. For example, the number of smokers in a village are counted while height of soldiers is measured. We may note here that if the area of investigation is large or the cost of measurement is high, the statistics may also be collected by examining only a fraction of the total area of investigation. When statistics are being obtained by measurement of units, it is necessary to maintain a reasonable degree or standard of accuracy in measurements. The degree of accuracy needed in an investigation depends upon its nature and objectivity on the one hand and upon time and resources on the other. For example, in weighing of gold, even milligrams may be significant where as, for weighing wheat, a few grams may not make much difference. Sometimes, a higher degree of accuracy is needed in order that the problem, to be investigated, gets highlighted by the data. Suppose the diameter of bolts produced by a machine are measured as 1.546 cms, 1.549 cms, 1.548 cms, etc. If, instead, we obtain measurements only up to two 3 Business Statistics places after decimal, all the measurements would be equal and as such nothing could be inferred about the working of the machine. In addition to this, the degree of accuracy also depends upon the availability of time and resources. For any investigation, a greater degree of accuracy can be achieved by devoting more time or resources or both. 5. Statistics are collected in a systematic manner and for a predetermined purpose: In order that the results obtained from statistics are free from errors, it is necessary that these should be collected in a systematic manner. Haphazardly collected figures are not desirable as they may lead to wrong conclusions. Moreover, statistics should be collected for a well defined and specific objective, otherwise it might happen that the unnecessary statistics are collected while the necessary statistics are left out. 6. Statistics should be capable of being placed in relation to each other: This characteristic requires that the collected statistics should be comparable with reference to time or place or any other condition. In order that statistics are comparable it is essential that they are homogeneous and pertain to the same investigation. This can be achieved by collecting data in identical manner for different periods or for different places or for different conditions. Hence, any set of numerical facts possessing the above mentioned characteristics can be termed as statistics or data. Example: Would you regard the following information as statistics? Explain by giving reasons. (i) The height of a person is 160 cms. (ii) The height of Ram is 165 cms and of Shyam is 155 cms. (iii) Ram is taller than Shyam. (iv) Ram is taller than Shyam by 10 cms. (v) The height of Ram is 165 cms and weight of Shyam is 55 kgs. Each of the above statement should be examined with reference to the following conditions: (a) Whether information is presented as aggregate of numerical figures (b) Whether numerical figures are homogeneous or comparable (c) Whether numerical figures are affected by a multiplicity of factors On examination of the given information in the light of these conditions we find that only the information given by statement (ii) can be regarded as statistics. The changes in the structure of human organisation, perfection in various fields and introduction of decision had given birth to quantitative technique. The application of Quantitative Techniques methods helps in making decisions in such complicated situation. Evidently the primarily objective of Quantitative Techniques is to study the different components of an organisation by employing the methods of mathematical statistics in order to get the behaviour with greater degree of control on the system. In short, the objective of Quantitative Technique is to make available scientific basis to the decision- maker, for solving the problem involving the interaction of different components of the organisation by employing a team of scientists from distinguish disciplines, all working in concert for finding a solution which is in the best interest of organisation as a whole. The best solution thus obtained is known as optimal decision. 4 Basic Statistics with 1.2 STATISTICAL SERIES Summary Measures and Graphic Representation The classified data when arranged in some logical order, e.g., according to the size, of data according to the time of occurrence or according to some other measurable or non- measurable characteristics, is known as Statistical Series. H. Secrist defined a statistical series as, "A series, as used statistically, may be defined as things or attributes of things arranged according to some logical order." Another definition given by L. R. Connor as, "If the two variable quantities can be arranged side by side so that the measurable differences in the one correspond to the measurable differences in the other, the result is said to form a statistical series." A statistical series can be one of the following four types: (i) Spatial Series, (ii) Conditional Series, (iii) Time Series and (iv) Qualitative or Quantitative Series The series formed by the geographical or spatial classification is termed as spatial series. Similarly, a series formed by the conditional classification is known as the conditional series. The examples of such series are already given under their respective classification category. Time Series A time series is the result of chronological classification of data. In this case, various figures are arranged with reference to the time of their occurrence. For example, the data on exports of India in various years is a time series. Year 1980 1981 1982 1983 1984 1985 1986 1987 1988 Exports(in Rs cr.) 6591 7242 8309 8810 9981 10427 11490 15741 20295 Qualitative or Quantitative Series This type of series is obtained when the classification of data is done on the basis of qualitative or quantitative characteristics. Accordingly, we can have two types of series, namely, qualitative and quantitative series. (a) Qualitative Series: In case of qualitative series, the number of items in each group are shown against that group. These groups are either expressed in ascending order or in descending order of the number of items in each group. The example of such a series is given below. Distribution of Students of a College according to Sex Sex Males Females Total No. of Students 1700 500 2200 (b) Quantitative Series: In case of quantitative series, the number of items possessing a particular value are shown against that value. A quantitative series can be of two types: I. Individual Series, and II. Frequency distribution. I. Individual series: In an individual series, the names of the individuals are written against their corresponding values. For example, the list of employees of a firm and their respective salary in a particular month. 5 Business Statistics II. Frequency Distribution: A table in which the frequencies and the associated values of a variable are written side by side, is known as a frequency distribution. According to Croxton and Cowden, "Frequency distribution is a statistical table which shows the set of all distinct values of the variable arranged in order of magnitude, either individually or in a group with their corresponding frequencies side by side." A frequency distribution can be discrete or continuous depending upon whether the variable is discrete or continuous. 1.3 CONSTRUCTION OF A FREQUENCY DISTRIBUTION Construction of a Discrete Frequency Distribution A discrete frequency distribution may be ungrouped or grouped. In an ungrouped frequency distribution, various values of the variable are shown along with their corresponding frequencies. If this distribution fails to reveal any pattern, grouping of various observations become necessary. The resulting distribution is known as grouped frequency distribution of a discrete variable. Furthermore, a grouped frequency distribution is also constructed when the possible values that a variable can take are large. Ungrouped Frequency Distribution of a Discrete Variable Suppose that a survey of 150 houses was conducted and number of rooms in each house was recorded as shown below: 5 4 4 6 3 2 2 6 6 2 6 3 3 4 5 6 3 2 2 5 3 1 4 5 1 5 1 4 3 2 5 1 5 3 2 2 4 2 2 4 4 6 3 2 4 2 3 2 4 6 3 3 2 6 4 1 4 4 5 2 4 1 4 2 1 5 1 3 3 2 5 6 1 3 1 5 3 4 3 1 1 4 1 1 2 2 1 5 2 3 6 3 5 2 2 3 3 3 3 4 5 1 6 2 1 2 1 1 6 5 2 1 1 5 6 4 2 2 3 3 3 4 3 2 1 5 2 3 1 1 4 6 4 6 2 2 4 5 6 3 6 4 1 2 4 2 2 3 4 5 Counting of frequency using Tally Marks The method of tally marks is used to count the number of observations or the frequency of each value of the variable. Each possible value of the variable is written in a column. For every observation, a tally mark denoted by ‘|’ is noted against its corresponding value. Five observations are denoted as , i.e., the fifth tally mark crosses the earlier four marks and so on. The method of tally marks is used below to determine the frequencies of various values of the variable for the data given above. In the above frequency distribution, the number of rooms 'X' is a discrete variable which can take integral values from 1 to 6. This distribution is also known as ungrouped frequency 6 Basic Statistics with distribution. It should be noted here that, in case of ungrouped frequency distribution, the Summary Measures and identity of various observations is not lost, i.e., it is possible to get back the original Graphic Representation observations from the given frequency distribution. of data Grouped Frequency Distribution of a Discrete Variable Consider the data on marks obtained by 50 students in statistics. The variable 'X' denoting marks obtained is a discrete variable, let the ungrouped frequency distribution of this data be as given in the following table. Marks Frequency Marks Frequency Marks Frequency 33 1 57 1 76 1 35 2 59 1 77 2 39 1 60 2 78 1 41 2 61 1 80 1 42 1 64 1 81 1 45 1 65 3 84 1 48 2 66 2 85 2 50 1 67 1 88 1 52 1 69 2 89 1 53 1 71 1 91 1 54 1 73 2 94 2 55 2 74 2 98 1 This frequency distribution does not reveal any pattern of behaviour of the variable. In order to bring the behaviour of the variable into focus, it becomes necessary to convert this into a grouped frequency distribution. Instead of above, if the individual marks are grouped like marks between and including 30 and 39, 40 and 49, etc. and the respective frequencies are written against them, we get a grouped frequency distribution as shown below: Marks between and including Frequency 30 - 39 4 40 - 49 6 50 - 59 8 60 - 69 12 70 - 79 9 80 - 89 7 90 - 99 4 Total 50 The above frequency distribution is more revealing than the earlier one. It is easy to understand the behaviour of marks on the basis of this distribution. It should, however, be pointed out here that the identity of observations is lost after grouping. For example, on the basis of the above distribution we can only say that 4 students have obtained marks between and including 30-39, etc. Thus, it is not possible to get back the original observations from a grouped frequency distribution. Construction of a Continuous Frequency Distribution As opposed to a discrete variable, a continuous variable can take any value in an interval. Measurements like height, age, income, time, etc., are some examples of a continuous variable. As mentioned earlier, when data are collected regarding these variables, it will show discreteness, which depends upon the degree of precision of the measuring instrument. Therefore, in such a situation, even if the recorded data appear to be discrete, it should be treated as continuous. Since a continuous variable can take any value in a given interval, therefore, the frequency distribution of a continuous variable is always a grouped frequency distribution. 7 Business Statistics To construct a grouped frequency distribution, the whole interval of the continuous variable, given by the difference of its largest and the smallest possible values, is divided into various mutually exclusive and exhaustive sub-intervals. These sub-intervals are termed as class intervals. Then, the frequency of each class interval is determined by counting the number of observations falling under it. The construction of such a distribution is explained below: The figures, given below, are the 90 measurements of diameter (in mm.) of a wire. 1.86, 1.58, 1.13, 1.46, 1.53, 1.65, 1.49, 1.03, 1.10, 1.36, 1.37, 1.46, 1.44, 1.46, 1.95, 1.67, 1.59, 1.35, 1.44, 1.40, 1.50, 1.41, 1.19, 1.16, 1.27, 1.21, 1.82, 1.55, 1.52, 1.42, 1.17, 1.62, 1.42, 1.22, 1.56, 1.78, 1.98, 1.31, 1.29, 1.69, 1.32, 1.68, 1.36, 1.55, 1.54, 1.67, 1.81, 1.47, 1.30, 1.33, 1.38, 1.34, 1.40, 1.37, 1.27, 1.04, 1.87, 1.45, 1.47, 1.35, 1.24, 1.48, 1.41, 1.39, 1.38, 1.47, 1.73, 1.20, 1.77, 1.25, 1.62, 1.43, 1.51, 1.60, 1.15, 1.26, 1.76, 1.66, 1.12, 1.70, 1.57, 1.75, 1.28, 1.56, 1.42, 1.09, 1.07, 1.57, 1.92, 1.48. The following decisions are required to be taken in the construction of any frequency distribution of a continuous variable. 1. Number of Class Intervals: Though there is no hard and fast rule regarding the number of classes to be formed, yet their number should be neither very large nor very small. If there are too many classes, the frequency distribution appears to be too fragmented to reveal the pattern of behaviour of characteristics. Fewer classes imply that the width of the class intervals will be broad and accordingly it would include a large number of observations. As will be obvious later that in any statistical analysis, the value of a class is represented by its mid-value and hence, a class interval with broader width will be representative of a large number of observations. Thus, the magnitude of loss of information due to grouping will be large when there are small number of classes. On the other hand, if the number of observations is small or the distribution of observations is irregular, i.e., not uniform, having more number of classes might result in zero or very small frequencies of some classes, thus, revealing no pattern of behaviour. Therefore, the number of classes depends upon the nature and the number of observations. If the number of observations is large or the distribution of observations is regular, one may have more number of classes. In practice, the minimum number of classes should not be less than 5 or 6 and in any case there should not be more than 20 classes. The approximate number of classes can also be determined by Struge's formula: n = 1 + 3.322 × log10N, where n (rounded to the next whole number) denotes the number of classes and N denotes the total number of observations. For the given data on the measurement of diameter, there are 90 observations. The number of classes by the Sturge's formula are n = 1 + 3.322.log1090 = 7.492 or 8 2. Width of a Class Interval: After determining the number of class intervals, one has to determine their width. The problem of determining the width of a class interval is closely related to the number of class intervals. As far as possible, all the class intervals should be of equal width. However, there can be situations where it may not be possible to have equal width of all the classes. 8 Suppose that there is a frequency distribution, having all classes of equal width, in which the pattern of behaviour of the observations is not regular, i.e., there are nil Basic Statistics with Summary Measures and or very few observations in some classes while there is concentration of Graphic Representation observations in other classes. In such a situation, one may be compelled to have of data unequal class intervals in order that the frequency distribution becomes regular. The approximate size of a class interval can be decided by the use of the following formula: Largest observation Smallest observation Class Interval = Number of class intervals LS or using notations, Class Interval n In the example, given above, L = 1.98 and S = 1.03 and n = 8. Approximate size of a class interval 1.98 1.03 0.1188 or 0.12 (approx.) 8 Before taking a final decision on the width of various class intervals, it is worthwhile to consider the following points: (a) Normally a class interval should be a multiple of 5, because it is easy to grasp numbers like 5, 10, 15,...., etc. (b) It should be convenient to find the mid-value of a class interval. (c) Most of the observations in a class should be uniformly distributed or concentrated around its mid-value. (d) As far as possible, all the classes should be of equal width. A frequency distribution of equal class width is convenient to be represented diagrammatically and easy to analyse. On the basis of above considerations, it will be more appropriate to have classes, each, with interval of 0.10 rather than 0.12. Further, the number of classes should also be revised in the light of this decision. L -S 1.98 -1.03 0.95 n= = = = 9.5 or 10 Class Interval 0.10 0.10 (rounded to the next whole number) 3. Designation of Class Limits: The class limits are the smallest and the largest observation in a class. These are respectively known as the lower limit and the upper limit of a class. For a frequency distribution, it is necessary to designate these class limits very unambiguously, because the mid-value of a class is obtained by using these limits. As will be obvious later, this mid-value will be used in all the computations about a frequency distribution and the accuracy of these computations will depend upon the proper specification of class limits. The class limits should be designated keeping the following points in mind: (a) It is not necessary to have lower limit of the first class exactly equal to the smallest observation of the data. In fact it can be less than or equal to the 9 Business Statistics smallest observation. Similarly, the upper limit of the last class can be equal to or greater than the largest observation of the data. (b) It is convenient to have lower limit of a class either equal to zero or some multiple of 5. (c) The chosen class limits should be such that the observations in a class tend to concentrate around its mid-value. This will be true if the observations are uniformly distributed in a class. The designation of class limits for various class intervals can be done in two ways: (i) Exclusive Method and (ii) Inclusive Method. (i) Exclusive Method: In this method the upper limit of a class is taken to be equal to the lower limit of the following class. To keep various class intervals as mutually exclusive, the observations with magnitude greater than or equal to lower limit but less than the upper limit of a class are included in it. For example, if the lower limit of a class is 10 and its upper limit is 20, then this class, written as 10-20, includes all the observations which are greater than or equal to 10 but less than 20. The observations with magnitude 20 will be included in the next class. (ii) Inclusive Method: Here all observations with magnitude greater than or equal to the lower limit and less than or equal to the upper limit of a class are included in it. The two types of class intervals, discussed above, are constructed for the data on the measurements of diameter of a wire as shown below: Class Intervals 20 - 29 30 - 39 40 - 49 50 - 59 Total Frequency 8 15 10 7 40 Mid-Value of a Class In exclusive types of class intervals, the mid-value of a class is defined as the arithmetic mean (to be discussed later) of its lower and upper limits. However, in the case of inclusive types of class intervals, there is a gap between the upper limit of a class and the lower limit of the following class which is eliminated by determining the class boundaries. Here, the mid-value of a class is defined as the arithmetic mean of its lower and upper boundaries. To find class boundaries, we note that the given data on the measurements of diameter of a wire is expressed in terms of milimetres, approximated upto two places after decimal. This implies that a value greater than or equal to 1.095 but less than 1.10 is approximated as 1.10 and, thus, included in the class interval 1.10-1.19. Similarly, an observation less than 1.095 but greater than 1.09 is approximated as 1.09 and is included in the interval 1.00-1.09. Keeping the precision of measurements in mind, various class boundaries, for the inclusive class intervals, given above, can be obtained by subtracting 0.005 from the lower limit and adding 0.005 to the upper limit of each class. These boundaries are given in the third column of the above table. 10 Construction of a Grouped Frequency Distribution for the Data on the Basic Statistics with Summary Measures and Measurements of Diameter of a Wire Graphic Representation of data Taking class intervals as 1.00-1.10, 1.10-1.20, etc. and counting their respective frequencies, by the method of tally marks, we get the required frequency distribution as given below: Example 1 Given below are the weights (in pounds) of 70 students. (i) Construct a frequency distribution when class intervals are inclusive, taking the lowest class as 60-69. Also construct class boundaries. (ii) Construct a frequency distribution when class intervals are exclusive, taking the lowest class as 60-70. 61, 80, 91, 113, 100, 106, 109, 73, 88, 92, 101, 106, 107, 97, 93, 96, 102, 114, 87, 62, 74, 107, 109, 91, 72, 89, 94, 98, 112, 103, 101, 77, 92, 73, 67, 76, 84, 90, 118, 107, 108, 82, 78, 84, 77, 95, 111, 115, 104, 69, 106, 105, 63, 76, 85, 88, 96, 90, 95, 99, 83, 98, 88, 72, 75, 86, 82, 86, 93, 92. Solution: (i) Construction of frequency distribution using inclusive class intervals. To determine the class boundaries, we note that measured weights are approximated to the nearest pound. Therefore, a measurement less than 69.5 is approximated as 69 and included in the class interval 60-69. Similarly, a measurement greater than or equal to 69.5 is approximated as 70 and is included in the class interval 70-79. Thus, the class boundaries are obtained by subtracting 0.5 from the lower limit and adding 0.5 to the upper limit of various classes. These boundaries are shown in the last column of the above table. 11 Business Statistics (ii) The frequency distribution of exclusive type of class intervals can be directly written from the above table as shown below: Class Inervals Frequency 60 - 70 5 70 - 80 11 80 - 90 14 90 - 100 18 100 - 110 16 110 - 120 6 Total 70 Relative or Percentage Frequency Distribution If instead of frequencies of various classes their relative or percentage frequencies are written, we get a relative or percentage frequency distribution. Frequency of the class Relative frequency of a class = Total Frequency Percentage frequency of a class = Relative frequency × 100 These frequencies are shown in the following table. Cumulative Frequency Distribution In order to answer the questions like; the measurements on diameter that are less than 1.70 or the number of measurements that are greater than 1.30, etc., a cumulative frequency distribution is constructed. A cumulative frequency distribution can be of two types: (i) Less than type cumulative frequency distribution (ii) More than type cumulative frequency distribution These frequency distributions, for the data on the measurements of diameter of a wire, are shown in Table I and Table II respectively. 12 Table I Table II Basic Statistics with Summary Measures and Graphic Representation Cumulative Cumulative Diameters Diameters of data Frequency Frequency less than 1.10 4 l ore than 1.00 M 90 less than 1.20 11 More than 1.10 86 less than 1.30 21 More than 1.20 79 less than 1.40 35 More than 1.30 69 less than 1.50 55 More than 1.40 55 less than 1.60 68 More than 1.50 35 less than 1.70 77 More than 1.60 22 less than 1.80 83 More than 1.70 13 less than 1.90 87 More than 1.80 7 less than 2.00 90 More than 1.90 3 Frequency Density Frequency density in a class is defined as the number of observations per unit of its width. Frequency density gives the rate of concentration of observations in a class: Frequency Density = Frequency of the class Width of the class Frequency Density ensity of various classesowing table: Class Frequency Intervals Frequency Density 1.00 - 1.10 4 40 1.10 - 1.20 7 70 1.20 - 1.30 10 100 1.30 - 1.40 14 140 1.40 - 1.50 20 200 1.50 - 1.60 13 130 1.60 - 1.70 9 90 1.70 - 1.80 6 60 1.80 - 1.90 4 40 1.90 - 2.00 3 30 Total 90 1.4 BIVARIATE AND MULTIVARIATE FREQUENCY DISTRIBUTIONS Bivariate Frequency Distributions In the frequency distributions, discussed so far, the data are classified according to only one characteristic. These distributions are known as univariate frequency distributions. There may be a situation where it is necessary to classify data, simultaneously, according to two characteristics. A frequency distribution obtained by the simultaneous classification of data according to two characteristics, is known as a bivariate frequency distribution. An example of such a classification is given below, where 100 couples are classified according to the two characteristics, Age of Husband and Age of Wife. The tabular representation of the bivariate frequency distribution is known as a contingency table. 13 Business Statistics Classification according to Age of Husband and Age of Wife in a sample of 100 couples Age of Husband Age of Wife 20-30 30-40 40-50 50-60 Total 20-30 26 0 0 0 26 30-40 20 15 0 0 35 40-50 5 17 10 0 32 50-60 0 0 6 1 7 Total 51 32 16 1 100 It should be noted that in a bivariate classification either or both the variable can be discrete or continuous. Further, there may be a situation in which one characteristic is a variable and the other is an attribute. Multivariate Frequency Distribution If the classification is done, simultaneously, according to more than two characteristics, the resulting frequency distribution is known as a multivariate frequency distribution. Example 2 Find the lower and upper limits of the classes when their mid-values are given as 15, 25, 35, 45, 55, 65, 75, 85 and 95. Solution: Note that the difference between two successive mid-values is same, i.e., 10. Half of this difference is subtracted and added to the mid value of a class in order to get lower limit and the upper limit respectively. Hence, the required class intervals are 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90,90-100. 1.5 GRAPHIC PRESENTATION OF DATA Graphic presentation is another way of pictorial presentation of the data. Graphs are commonly used for presentation of time series and frequency distributions. In situations where the diagrams as well as the graphs can be used, the later is preferred because of its advantages over the former. Graphic presentation of data, like diagrammatic presentation, also provides a quick and easier way of understanding various trends of data and to facilitate the process of comparison of two more situations. In addition to this, it can also be used as a tool of analysis. Graphic methods are sometimes used in place of mathematical computations to save time and labour, e.g., free hand curves may be fitted in place of mathematical curve to determine trend values. Advantages of Graphic Presentation A properly constructed graph may provide more information than the tabular or diagrammatically presented data. Graphic presentation may indicate the nature of trend present and the manner in which it is likely to change in future. Various advantages of a graphic presentation are: (i) Graphs provide an attractive and lasting effect. (ii) Graphs are easy to understand. (iii) Graphs provide easy comparison of two or more phenomena. (iv) Graphs provide a method of locating certain positional averages like median, mode, quartiles, etc. It can also be used to study correlation between two variables. 14 (v) Graphs also facilitate the process of interpolation, extrapolation and forecasting. Basic Statistics with Summary Measures and (vi) It saves time and energy of the statistician as well as of the observer. Graphic Representation of data (vii) It can indicate the nature and the direction of trend of the data. Construction of a Graph A point in a plane can be located with reference to two mutually perpendicular lines. The horizontal line is called the X-axis and the vertical line as the Y-axis. Their point of intersection is termed as origin. The position of a point in a plane is located by its distance from the two axes. If a point P is 4 units away from Y-axis and 3 units away from X-axis, its location will be as shown in the figure. Figure 1.1 It should be noted here that the distance of the point from Y- axis is measured along X-axis and its distance from X-axis is measured along Y-axis. To measure 4 units from Y-axis, one moves 4 units along X-axis and erects a perpendicular at this point. Similarly, to measure 3 units from X-axis, one moves 3 units along Y-axis and erects a perpendicular. The point of intersection of these two perpendiculars will be the required point. The position of a point is denoted by a pair of numbers, e.g., (4,3) for the point P, that are respectively termed as abscissa and ordinate of the point. Jointly they are termed as the coordinates of a point. The coordinates of a point, in general form, are written as (x, y). The four parts of the plane are called quadrants, as shown in the above figure. It may be noted that both x and y are positive in the first, x is negative and y is positive in the second, x and y are both negative in the third and x is positive and y is negative in the fourth quadrant. Different points can be plotted for a different pairs of values, e.g., for data on demand of a commodity at different prices, we can locate a point for each pair of quantity and price combination. These points are then joined by a curve or a line to get the required graph. General Rules for the Graphic Presentation (i) Every graph must have a suitable title written at its top. This title should indicate the facts presented by the graph in a comprehensive and unambiguous manner. (ii) By convention, the independent variable is normally measured along X-axis and the dependent variable on Y-axis. The scale on Y-axis must always start from zero. If the fluctuations are small as compared to the size of the variable, there is no need to show the entire vertical axis from origin. This can be done by showing a gap in the vertical axis and drawing a horizontal line from it. This line is often termed as a false base line. 15 Business Statistics (iii) The choice of a scale of measurement should be such that the whole data can be accommodated in the available space and all of its important fluctuations are clearly depicted. (iv) Proportional changes in the values of the variables can be shown by drawing a ratio or logarithmic scale. (v) A graph must not be overcrowded with curves. (vi) When more than one curve is to be shown on the same graph, it is necessary to distinguish them by drawing curves of different pattern or colour. (vii) An index should always be given to show the scales and the interpretations of different curves. (viii) The source of data should be mentioned as a footnote. Difference between a Diagram and a Graph A brief distinction between a diagram and a graph is given below. Diagram Graph 1. Can be drawn on an ordinary paper. 1. Can be drawn on a graph paper. 2. Easy to grasp. 2. Needs some effort to grasp. 3. Not capable of analytical treatment. 3. Capable of analytical treatment. 4. Can be used only for comparisons. 4. Can be used to represent a mathematical relation. 5. Data are represented by bars, rectangles 5. Data are represented by lines and curves pictures, etc. A graphic presentation is used to represent two types of statistical data: (i) Time Series Data and (ii) Frequency Distribution. 1.6 TIME SERIES GRAPHS OR HISTORIGRAMS A time series is a series of values of a variable recorded at successive intervals of time; e.g., the figures of national income, sale, production, employment, enrolment, etc., at successive points of time. In case of a time series graph, the time is measured along X-axis and the other variable along Y-axis. Such a graph is also termed as historigram. If the actual values of a series are plotted, the resulting graph is called as Absolute Historigram while if their indices are plotted, we get an Index Historigram. Various types of time series graphs or historigrams are: (i) Line Graph (ii) Range or Variation Graph (iii) Component or Band Graph (iv) Net balance or Silhouette Graph (v) Zee Chart or Z-curve 1.7 LOGARITHMIC GRAPHS OR RATIO CHARTS The scales used in the graphs, discussed so far, were arithmetic or natural or absolute; where equal distances were used to show equal differences in absolute magnitudes. For 16 example, if the assumed scale is 1 cm = 10 units, then the gap between values the 200 and 210 (or 250 and 260, etc.) will be 1 cm. and the gap between the values 240 and 280 Basic Statistics with Summary Measures and will be 4 cms, when plotted on a line. This is shown as below: Graphic Representation of data If instead of plotting the absolute values on a graph, their logarithms are plotted, the scale is known as logarithmic or ratio scale. In such a scale equal ratio of values are represented by equal distances. In order to understand this, let us plot the logarithms of the values 200, 210, 220, 230,240 and 250. Values 200 210 220 230 240 250 Log of values 2.3010 2.3222 2.3424 2.3617 2.3802 2.3979 By selecting a suitable scale of log units, the above values are plotted on the line as shown below: Here the distance between log210 and log200 = 2.3222-2.3010 = 0.0212 mm Also, antilog0.0212 = 1.05 which is equal to the ratio 210 200 Similarly, the distance between log240 and log210 = 2.3802-2.3222 = 0.0580 mm. Further, 240 antilog0.0580 = 1.1428, which is equal to the ratio. Hence, the antilog of the distance 210 between logarithms of two numbers represents their ratio. As we know that there are two axes in a graphic presentation. When logarithms of values of both the variables are plotted, it is called a logarithmic graph. Similarly, when one variable is plotted on a logarithmic scale against the other in natural units, the graph is termed as a semi-logarithmic graph or a ratio chart. Logarithmic Graph As mentioned above, a logarithmic graph is obtained when both the axes are measured in logarithmic units. In such a graph, proportional changes in one variable are plotted against the proportional changes in the other. Thus, the slope of the curve, at any point, measures the elasticity of the variable, taken on vertical axis with respect to the variable taken on horizontal axis. Semi-logarithmic Graph In a semi-logarithmic graph, the proportional changes in one variable are plotted against the absolute change in the other. Such a graph is very useful for the interpretation of time series data when it is desired to study the pattern of rate of growth of the variable rather than its absolute rate of change. For example, the manager of a firm may be interested in knowing the rate of growth or fall of national income, level of employment, prices, etc. The rate of growth at a particular point of time is given by the slope of the time series curve plotted on a semi-logarithmic graph. Many a times, when the range of values of observation is very large, it may be difficult to represent such observations using arithmetic scale. Considerable economy of space can be achieved by the use of logarithmic scale. 17 Business Statistics Semi-Logarithmic Graph vs. Natural Graph In order to understand the basic difference between the two types of graphs consider the following two time series. Years 1985 1986 1987 1988 1989 1990 Profits of Firm A (Rs '000) 100 120 140 160 180 200 Profits of Firm B (Rs '000) 100 120 144 172.8 207.4 248.9 It should be noted that the profits of firm A are increasing by Rs 20,000 each year while the profits of firm B are increasing by 20% each year. These two time series are plotted on the same graphs using natural scale, as shown below. Figure 1.2 We note that the graph of firm A is a straight line while the graph of firm B is a curve. However, if the profits of the firm B are plotted on a semi-logarithmic graph, we get a straight line, as shown below. Year 1985 1986 1987 1988 1989 1990 Profits (Rs '000) 100 120 144 172.8 207.4 248.9 Logarithms 2.0000 2.0792 2.1584 2.2375 2.3168 2.3960 Figure 1.3 Note: 1. It is obvious from Figure 1.3 that if the difference of successive values is not constant, the corresponding graph is a curve. Similarly, when ratios of successive values are not same, the semi-logarithmic graph will be a curve. 2. The vertical axis starts from zero on a natural scale where as any positive number can be taken as the starting point on a logarithmic scale. Negative values cannot be plotted on a logarithmic scale. 3. Since the starting point of a semi-logarithmic graph can be any positive value, there is no need of having a false base line. 18 Construction of Semi-logarithmic Graphs Basic Statistics with Summary Measures and A semi-logarithmic graph can be drawn in either of the following two ways: Graphic Representation of data (i) By taking logarithms of the values and plotting these values on a natural scale. (ii) By plotting the given values on a semi-logarithmic graph paper. Such graph papers are available from the market. Use of such graph papers avoids the botheration of taking logarithms of various numbers. In this graph, the scale along horizontal axis is natural, i.e., equal distances denote equal absolute difference in values and the scale along the vertical axis is logarithmic, i.e., equal distances denote equal ratio of values. The specimen of such a graph paper is given below. Semi-logarithmic Graph Paper Figure 1.4 It should be noted here that the distances between the values having same ratios are equal. For example, the distance between 1 and 2, 2 and 4, 3 and 6, 4 and 8, 5 and 10, 10 and 20, 20 and 40, etc., are equal as the ratio of each of the pair of numbers is 2. Further 3=6 the distance between 1 and 3 is equal to distance between 2 and 6 as the ratio. 1 2 Example 3 The following table gives the sales (in Rs lacs) of the two firms in different months of 1991. Represent the following data by semi-logarithmic graphs. Months Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Firm A 100 105 115.5 138.6 159.4 175.3 166.6 149.9 127.4 101.9 91.7 87.2 Firm B 120 126 138.6 166.3 191.3 210.4 199.9 179.9 152.9 122.3 110.1 104.6 Solution: The above data will be plotted on a semi-logarithmic graph. Months Jan Feb Mar Apr May Jun log of Sales of A 2.0000 2.0212 2.0626 2.1418 2.2025 2.2438 log of Sales of B 2.0792 2.1004 2.1418 2.2209 2.2817 2.3230 Months Jul Aug Sep Oct Nov Dec log of Sales of A 2.2217 2.1758 2.1052 2.0082 1.9624 1.9405 log of Sales of B 2.3008 2.2550 2.1844 2.0874 2.0418 2.0195 19 Business Statistics Figure 1.5 The above graph reveals the following points: (i) Since the graph of sales is a curve for each firm, the rate of change of sales are different in different months. Further, the sales have shown a rise upto June and decline in the later months. (ii) For both the firms, the segment AB of the curve is rising and convex from below, which implies that the sales are increasing at increasing rate. (iii) For both the firms, the segment BC of the curve is rising and concave from below, which implies that the sales are increasing at decreasing rate. (iv) For both the firms, the segment CD of the curve is falling and concave from below, which implies that the sales are decreasing at decreasing rate. (v) For both the firms, the segment DE of the curve is falling and convex from below, which implies that the sales are decreasing at increasing rate. (vi) For both the firms, the vertical gaps between the two curves is constant (or equivalently, their slopes are equal) for each month, which implies that rate of increase or decrease of sales in a given month is approximately same. Limitations of a Semi-logarithmic Graph 1. Since logarithms of negative numbers are not defined, a semi-logarithmic graph cannot be used to represent negative values. Accordingly, 'Net Balance Graph' cannot be drawn on semi-logarithmic scale. 2. Interpretation of a semi-logarithmic graph by a layman is very difficult. 3. Such a graph cannot be used to study various components of an aggregate of a value. 1.8 GRAPH OF A FREQUENCY DISTRIBUTION A frequency distribution can also be represented by means of a graph. The most common forms of graphs of a frequency distribution are: 1. Histogram 2. Frequency Polygon 3. Frequency Curve 4. 'Ogive' or Cumulative Frequency Curve 20 1. Histogram: A histogram is a graph of a frequency distribution in which the class Basic Statistics with Summary Measures and intervals are plotted on X-axis and their respective frequencies on Y-axis. On each Graphic Representation class, a rectangle is erected with its height proportional to the frequency density of of data the class. (a) Construction of a Histogram when Class Intervals are equal: In this case the height of each rectangle is taken to be equal to the frequency of the corresponding class. The construction of such a histogram is illustrated by the following example. Example 4 The frequency distribution of marks obtained by 60 students of a class in a college is given below: Marks 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 No. of Students 3 5 12 18 14 6 2 Draw a histogram for the distribution. Solution: Since the upper limit of a class is not equal to the lower limit of its following class, the class boundaries will have to be determined. The distribution after adjustment will be as given below. Marks No. of Students Histogram 29.5 - 34.5 3 34.5 - 39.5 5 39.5 - 44.5 12 44.5 - 49.5 18 49.5 - 54.5 14 54.5 - 59.5 6 59.5 - 64.5 2 Figure 2.6 (b) Construction of a Histogram when Class Intervals are not equal: When different classes of a frequency distribution are not equal, the frequency density (frequency width) of each class is computed. The product of frequency density and the width of the class having shortest interval is taken as the height of the corresponding rectangle. 2. Frequency Polygon: A frequency polygon is another method of representing a frequency distribution on a graph. Frequency polygons are more suitable than histograms whenever two or more frequency distributions are to be compared. A frequency polygon is drawn by joining the mid-points of the upper widths of adjacent rectangles, of the histogram of the data, with straight lines. Two hypothetical class intervals, one in the beginning and the other in the end, are created. The ends of the polygon are extended upto base line by joining them with the mid-points of hypothetical classes. This step is necessary for making area under the polygon to be approximately equal to the area under the histogram. Frequency polygon can also be constructed without making rectangles. The points of frequency polygon are obtained by plotting mid-points of classes against the heights of various rectangles, which will be equal to the frequencies if all the classes are of equal width. 21 Business Statistics Example 5 The daily profits (in rupees) of 100 shops are distributed as follows: Profit / Shop 0 - 100 100 - 200 200 - 300 300 - 400 400 - 500 500 - 600 No. of Shops 12 18 27 20 17 6 Construct a frequency polygon of the above distribution. Solution: Frequency Polygon Figure 1.7 3. Frequency curve: When the vertices of a frequency polygon are joined by a smooth curve, the resulting figure is known as a frequency curve. As the number of observations increases, there is need of having more and more classes to accommodate them and hence the width of each class will become smaller and smaller. In such a situation the variable under consideration tend to become continuous and the frequency polygon of the data tends to acquire the shape of a frequency curve. Thus, a frequency curve may be regarded as a limiting form of frequency polygon as the number of observations become large. The construction of a frequency curve should be done very carefully by avoiding, as far as possible, the sharp and sudden turns. Smoothing should be done so that the area under the curve is approximately equal to the area under the histogram. A frequency curve can be used for estimating the rate of increase or decrease of the frequency at a given point. It can also be used to determine the frequency of a value (or of values in an interval) of the variable. This method of determining frequencies is popularly known as interpolation method. 4. Cumulative Frequency Curve or Ogive: The curve obtained by representing a cumulative frequency distribution on a graph is known as cumulative frequency curve or ogive. Since a cumulative frequency distribution can of 'less than' or 'greater than' type and, accordingly, there are two type of ogive, 'less than ogive' and 'more than ogive'. An ogive is used to determine certain positional averages like median, quartiles, deciles, percentiles, etc. We can also determine the percentage of cases lying between certain limits. Various frequency distributions can be compared on the basis of their ogives. 22 Example 6 Basic Statistics with Summary Measures and Draw 'less than' and 'more than' ogives for the following distribution of monthly salary of Graphic Representation of data 250 families of a certain locality. Income Intervals 0-500 500-1000 1000-1500 1500-2000 2000-2500 2500-3000 3000-3500 3500-4000 No. of Families 50 80 40 25 25 15 10 5 Solution: First we construct 'less than' and 'more than' type cumulative frequency distributions. Income Cumulative Income Cumulative less than Frequency More than Frequency 500 50 0 250 1000 130 500 200 1500 170 1000 120 2000 195 1500 80 2500 220 2000 55 3000 235 2500 30 3500 245 3000 15 4000 250 3500 5 Ogive Figure 1.8 We note that the two ogives intersect at the median. 1.9 MEASURES OF CENTRAL TENDENCY Summarisation of the data is a necessary function of any statistical analysis. As a first step in this direction, the huge mass of unwieldy data are summarised in the form of tables and frequency distributions. In order to bring the characteristics of the data into sharp focus, these tables and frequency distributions need to be summarised further. A measure of central tendency or an average is very essential and an important summary measure in any statistical analysis. An average is a single value which can be taken as representative of the whole distribution. Functions and Characteristics of an Average Functions of an Average 1. To present huge mass of data in a summarised form: It is very difficult for human mind to grasp a large body of numerical figures. A measure of average is used to summarise such data into a single figure which makes it easier to understand and remember. 23 Business Statistics 2. To facilitate comparison: Different sets of data can be compared by comparing their averages. For example, the level of wages of workers in two factories can be compared by mean (or average) wages of workers in each of them. 3. To help in decision-making: Most of the decisions to be taken in research, planning, etc., are based on the average value of certain variables. For example, if the average monthly sales of a company are falling, the sales manager may have to take certain decisions to improve it. Characteristics of a Good Average A good measure of average must posses the following characteristics: 1. It should be rigidly defined, preferably by an algebraic formula, so that different persons obtain the same value for a given set of data. 2. It should be easy to compute. 3. It should be easy to understand. 4. It should be based on all the observations. 5. It should be capable of further algebraic treatment. 6. It should not be unduly affected by extreme observations. 7. It should not be much affected by the fluctuations of sampling. 1.10 VARIOUS MEASURES OF AVERAGE Various measures of average can be classified into the following categories: (a) Mathematical Averages: (i) Arithmetic Mean or Mean (ii) Geometric Mean (iii) Harmonic Mean (iv) Quadratic Mean (b) Positional Averages: (i) Median (ii) Mode 1.11 ARITHMETIC MEAN Before the discussion of arithmetic mean, we shall introduce certain notations. It will be assumed that there are n observations whose values are denoted by X 1,X2,..... X n respectively. The sum of these observations X 1 + X2 +..... + Xn will be denoted in n abbreviated form as X i , where (called sigma) denotes summation sign.The subscript i 1 of X, i.e., 'i' is a positive integer, which indicates the serial number of the observation. Since there are n observations, variation in i will be from 1 to n. This is indicated by writing it below and above , as written earlier. When there is no ambiguity in range of summation, this indication can be skipped and we may simply write X1 + X2 +..... + Xn = Xi. Arithmetic Mean is defined as the sum of observations divided by the number of 24 observations. It can be computed in two ways: (i) Simple arithmetic mean and (ii) weighted arithmetic mean. In case of simple arithmetic mean, equal importance is Basic Statistics with Summary Measures and given to all the observations while in weighted arithmetic mean, the importance given to Graphic Representation various observations is not same. of data Calculation of Simple Arithmetic Mean (a) When Individual Observations are given: Let there be n observations X1, X2..... Xn. Their arithmetic mean can be calculated either by direct method or by short cut method. The arithmetic mean of these observations will be denoted by X. When the entire populationi (N) is considered instead of a sample n i i 1 (n) then the arithmetic mean is represented by the saymbol (µ) = N Direct Method: Under this method, X is obtained by dividing sum of observations by number of observations, i.e., n X i 1 i X n Short-cut Method: This method is used when the magnitude of individual observations is large. The use of short-cut method is helpful in the simplification of calculation work. Let A be any assumed mean. We subtract A from every observation. The difference between an observation and A, i.e., Xi-A is called the deviation of ith observation from A and is denoted by di. Thus, we can write ; d1 = X1-A, d2 = X2-A,..... dn = Xn-A. On adding these deviations and dividing by n we get d X i i A X i nA X i A n n n n or d XA (Where d d i ) n On rearranging, we get X A d A d i n This result can be used for the calculation of X. Remarks: Theoretically we can select any value as assumed mean. However, for the purpose of simplification of calculation work, the selected value should be as nearer to the value of X as possible. Example 7 The following figures relate to monthly output of cloth of a factory in a given year: Months Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Output 80 88 92 84 96 92 96 100 92 94 98 86 (in '000 metres) Calculate the average monthly output. Solution: (i) Using Direct Method 80 88 92 84 96 92 96 100 92 94 98 86 X = 91.5 ('000 mtrs) 12 25 Business Statistics (ii) Using Short-cut Method Let A = 90. Xi 80 88 92 84 96 92 96 100 92 94 98 86 Total d i = Xi - A 10 2 2 6 6 2 6 10 2 4 8 4 d i = 18 18 X 90 = 90 + 1.5 = 91.5 thousand mtrs 12 (b) When data are in the form of an ungrouped frequency distribution Let there be n values X1, X2,..... Xn out of which X1 has occurred f1 times, X2 has occurred f 2 times,..... X n has occurred f n times. Let N be the total frequency, n i.e., N = f i 1 i. Alternatively, this can be written as follows: Values X1 X2 … Xn Total Frequency Frequency f1 f2 … fn N Direct Method: The arithmetic mean of these observations using direct method is given by X1 X1 ... X1 X 2 ... ... X 2 ... ... X n ... X n f1times f2times f ntimes x= f1 f 2 ... f n Since X1 + X1 +..... + X1 added f1 times can also be written f1X1. Similarly, by writing other observation in same manner, we have n n fi X i fi X i f X f 2 X 2 ... f n X n i 1 i 1 µ 1 1 f1 f 2 ... f n n N.... (1) f i i 1 Short-Cut Method: As before, we take the deviations of observations from an arbitrary value A. The deviation of ith observation from A is di = Xi – A. Multiplying both sides by fi we have fi di = fi (Xi – A) Taking sum over all the observations f df d=f bfXbX A gA= i ii i i i i gf Xf X AA i f f = f X -A.N A i ii i i i i i Dividing both sides by N we have fi di fi Xi fi di A µ A or µ A Ad. N N N (c) When data are in the form of a grouped frequency distribution In a grouped frequency distribution, there are classes along with their respective frequencies. Let li be the lower limit and ui be the upper limit of i th class. Further, let the number of classes be n, so that i = 1, 2,.....n. Also let fi be the frequency of i th class. This distribution can written in tabular form, as shown. Note: Here u1 may or may not be equal to l2, i.e., the upper limit of a class may or may 26 not be equal to the lower limit of its following class. It may be recalled here that, in a grouped frequency distribution, we only know the Basic Statistics with Summary Measures and number of observations in a particular class interval and not their individual magnitudes. Graphic Representation Therefore, to calculate mean, we have to make a fundamental assumption that the of data observations in a class are uniformly distributed. Under this assumption, the mid-value of a class will be equal to the mean of observations in that class and hence can be taken as their representative. Therefore, if Xi is the mid-value of i th class with frequency fi , the above assumption implies that there are fi observations each with magnitude Xi (i = 1 to n). Thus, the arithmetic mean of a grouped frequency distribution can also be calculated by the use of the formula. Class Intervals Frequency (f) l1 - u1 f1 l2 - u 2 f2 ln - u n fn Total = fi = N Frequency Remarks: The accuracy of arithmetic mean calculated for a grouped frequency distribution depends upon the validity of the fundamental assumption. This assumption is rarely met in practice. Therefore, we can only get an approximate value of the arithmetic mean of a grouped frequency distribution. Example 8 The following table gives the distribution of weekly wages of workers in a factory. Calculate the arithmetic mean of the distribution. Weekly 240-269 270-299 300-329 330-359 360-389 390-419 420-449 Wages No. of 7 19 27 15 12 12 8 Workers Solution: It may be noted here that the given class intervals are inclusive. However, for the computation of mean, they need not be converted into exclusive class intervals. Class Mid Frequency d = X - 344.5 fd Intervals Values (X) 240-269 254.5 7 90 630 270-299 284.5 19 60 1140 300-329 314.5 27 30 810 330-359 344.5 15 0 0 360-389 374.5 12 30 360 390-419 404.5 12 60 720 420-449 434.5 8 90 720 Total 100 -780 X A fd 344.5 780 336.7 N 100 Step Deviation Method or Coding Method In a grouped frequency distribution, if all the classes are of equal width, say 'h', the successive mid-values of various classes will differ from each other by this width. This fact can be utilised for reducing the work of computations. 27 Business Statistics Xi - A Let us define ui =. Multiplying both sides by fi and taking sum over all the h n 1 n observations we have, fi ui i 1 fi Xi A h i 1 n n n n or h fi ui fi Xi A fi fi Xi A.N i 1 i 1 i 1 i 1 Dividing both sides by N, we have n n i 1 f i ui i 1 fi X i h A X A N N n fu i i X A h i 1.... (2) N Using this relation we can simplify the computations of Example 8, as shown below. X - 344. 5 u= -3 -2 -1 0 1 2 3 Total 30 f 7 19 27 15 12 12 8 100 fu - 21 - 38 - 27 0 12 24 24 - 26 Using formula (2), we have 30 26 X 344.5 = 336.7 100 Charlier's Check of Accuracy When the arithmetic mean of a frequency distribution is calculated by short-cut or step- deviation method, the accuracy of the calculations can be checked by using the following formulae, given by Charlier. For short-cut method f d i i 1 fi di fi or f d f d i i i i 1