Business Statistics MBA I SEM I PDF
Document Details
Uploaded by Deleted User
Symbiosis International (Deemed University)
Dr. Anugamini Priya Srivastava
Tags
Summary
This document is an e-content on Business Statistics for MBA first semester. It provides an introduction to statistics, including definitions, characteristics, and various stages.
Full Transcript
Established under Section 3 of the UGC Act. 1956 I Awarded Category - I by UGC E-CONTENT BUSINESS STATISTICS MBA I SEM I Dr. Anugamini Priya Srivastava Established under Section 3 of the UGC Act. 1956 I Awarded Category - I by UG...
Established under Section 3 of the UGC Act. 1956 I Awarded Category - I by UGC E-CONTENT BUSINESS STATISTICS MBA I SEM I Dr. Anugamini Priya Srivastava Established under Section 3 of the UGC Act. 1956 I Awarded Category - I by UGC Gram: Lavale, Tal: Mulshi, Dist: Pune, Maharashtra, India Pin: 412115 E-CONTENT BUSINESS STATISTICS MBA I SEM I Internal Advisory Board (Self-Learning Material) Chancellor : Prof. Dr. S B Mujumdar (M.Sc. Ph.D.) Distinguished Academician & Educationist (Awarded Padma Bhushan and Padma Shri by President of India) Pro-Chancellor, Symbiosis International : Dr. Vidya Yeravdekar (Deemed University) & Principal Director, Symbiosis Vice-Chancellor : Dr. Rajani R.Gupte Provost, Faculty of Health Sciences : Dr. Rajiv Yeravdekar Dean-Academics & Administration : Dr. Bhama Venkataramani Dean, Faculty of Law : Dr. Shashikala Gurpur Dean, Faculty of Management : Dr. R. Raman Dean, Faculty of Computer Studies : Dr. Dhanya Pramod Dean, Faculty of Media and Communication : Dr. Ruchi Jaggi Dean , Faculty of Humanities & Social Sciences : Dr. Jyoti Chandiramani Dean, Faculty of Engineering : Dr. Ketan Kotecha Dean, Faculty of Architecture & Design : Dr. Sanjeevani Ayachit Director, Symbiosis School for Online & Digital Learning : Dr. Raju Ganesh Sunder Programme Coordinator - Management Programmes Dr. Pravin Narayan Mahamuni Self-Learning Material: E-Content T2217 Business Statistics (MBA I SEM I) Authors : Dr. Anugamini Priya Srivastava Editor & Reviewer : Dr. Akriti Chaubey ISBN: 978-93-95877-05-3 Copy Right © Registrar, Symbiosis International (Deemed University), Pune All rights reserved. No part of this content may be reproduced or used in any form or by any means without prior permission from the publisher. This content is produced only for academic purpose and is for internal circulation only. Acknowledgement : Every attempt has been made to trace the copyright holders of material reproduced in this content. Should any infringement have occurred. We apologize for the same and would be pleased to make necessary corrections in future editions of this content. Published by : Symbiosis School for Online and Digital Learning, SIU, Lavale, Pune Layout and Designing : Neha Creations, Pune 411030 CONTENT MODULE - 1 : Introduction to Business Statistics 01 MODULE - 2 : Data Presentation 12 MODULE - 3 : Measures of Central Tendency 25 MODULE - 4 : Measures of Dispersion 41 MODULE - 5 : Measures of Dispersion 53 MODULE - 6 : Introduction to Probability 61 MODULE - 7 : Discrete Probability Distribution 72 MODULE - 8 : Continuous Probability Distributions 85 MODULE - 9 : Sampling and Different Sampling Techniques 96 MODULE - 10 : Hypothesis Testing 109 MODULE - 11 : Non-Parametric Tests 121 MODULE - 12 : Analysis Of Variance (ANOVA) 131 MODULE - 13 : Simple Linear Regression 150 MODULE - 14 : Spss And Data Analysis 164 INTRODUCTION TO MODULE - 1 BUSINESS STATISTICS STRUCTURE ¡ Statistics and research ¡ Statistics definition ¡ Characteristics of statistics as statistical data ¡ Statistical methods ¡ Stages of statistical investigation ¡ Functions of statistics ¡ Scope of the statistics ¡ Limitations of statistics ¡ Summary 1.1 LEARNING OBJECTIVES ¡ To understand what is business statistics ¡ To understand the characteristics of business statistics ¡ To understand the uses and limitations of business statistics ¡ To understand the process of statistics 1.2 STATISTICS AND RESEARCH Research is all about looking for something which remains to be searched. It is a process to explore different topics and detect certain conclusions. Over time, analysis has been utilised to bring groundbreaking theories and principles to guide other disciplines and subject areas. However, research and exploration require empirical testing. Where on the one hand, theoretical models are derived from literature and exploration, and statistical methods are utilised to test statistically and verify the theoretical model. Researchers have acknowledged the importance of statistics in research. The use of scientific methods has displayed their relevance in scientific research. The complicated yet straightforward process of research starts from planning to drawing meaningful conclusions and can be sorted to be dominated by statistical methods. The research steps comprise planning, designing, collecting, and analysing data, followed by making valid inferences and reporting the final results. All of these steps of research require statistical understanding. Where planning and designing require specific research design, statistics help to understand the conceptual framework and enable effective execution of the next steps. The clarity of the conceptual framework can help decide the method to select for data collection. Data collection involves sample selection and data collection based on the research design. Statistical understanding helps explore the variety of methods for sample selection and collection of data. 01 From probability sampling to non-probability sampling, statistical knowledge helps researchers choose the best possible sampling method. Similarly, statistical understanding helps identify and explore different modes of collection of data and choose the best possible one from primary and secondary data sources. Along the same line, statistics provides a wide range of methods to explore the data duly collected. It enables the recording and cleaning of the data based on missing values and outliers. Further, it has in identifying the most effective methodology to understand the characteristics of the data per variable, explore the average or dispersion in the data, the fitness of the model and the predictability of variables on others. Statistics also support making valuable inferences based on the statistical thresholds. Using different references and guidelines, researchers can make useful inferences based on their research ideas and hypothesis. And lastly, in research, statistics can provide proper formats of data presentation - tabular or diagrammatic to enable researchers to represent their creativity in report writing. Thus, statistical analysis in research can prove to be the lifeline. 1.3 STATISTICS DEFINITION many scholars and researchers defined statistics over some time. On the one hand, statistics was defined as statistical data. This definition was taken as understanding in the plural sense of statistics. According to Webster, statistics refers to "the classified facts respecting the condition of people in a state, especially those facts which can be stated in numbers or tables of numbers or any tabular or classified arrangement". The definition provided by Webster is too narrow. It can find the scope of statistics to only such facts and figures related to the people's condition in a state. Yule and Kendall defined statistics as "quantitative data affected to a marked extent by a multiplicity of causes". In this definition, again, the author restricted the meaning of statistics to only quantitative data and varied factors affecting it. However, statistics can also comprise some part of qualitative data analysis. Considering the limitation of these definitions, professor Horace Secrist provided the most comprehensive explanation for statistics as statistical data. Professor defined statistics as "aggregates of facts affected to a marked extent by a multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standards of accuracy, collected in a systematic manner for a predetermined purpose and placed in relation to each other” The definition provided explains the different dimensions and characteristics of statistics as a statistical method. So, let's understand every segment of this definition explained in 1.4. 1.4 CHARACTERISTICS OF STATISTICS AS STATISTICAL DATA 1.4.1. Aggregates of facts the first characteristic of statistics elaborates that statistics do not work with single and isolated numbers of figures. It always includes aggregated facts and figures. The aggregated facts and figures insurance effective comparability and establishment of relationships. For example, data on the salary of an individual, individual purchase decision, birth, death etc., does not comprise statistics. However, aggregated data on these factors can be taken into consideration in statistics. 02 1.4.2. Affected to a marked extent by a multiplicity of causes generally, the facts and figures are influenced by the number of forces operating together for a considerable time. Assuming that a particular point or figure works in isolation will be utterly wrong in statistics. For example, suppose you talk about the rice production quality in a year. In that case, we need to consider the data on the quantity of rainfall, the quality of soil seeds, and the cultivation methods. So, in statistics, every piece of information collected is affected by multiple causes. Different ways and means have been identified for segregating the effects of other forces on a process. Similarly, statistics also considered the difficulty involved in measuring the complex impact variety of factors which are not measurable. 1.4.3. Numerically expressed all the facts and figures are numerically expressed in statistics. All the figures need to be presented as numbers; only then can they be processed through statistical methods. For example, a statement like "the babies learn faster in the first year after birth". This statement is not numerically expressed. While a word like "the babies learn 39% of their senses in the first year after birth" can be considered under statistics. 1.4.4. Enumerated or estimated according to reasonable standards of accuracy- in statistics, the data is either estimated or enumerated. In situations where actual data is unavailable, the overall statistics listed-like in election rallies - will be difficult to count the number of individuals in recovery. Thus, enumeration is a better way to identify the statistical data. While in a classroom, we can easily count and estimate the number of students taking sessions. Here we can easily say that 45 students in a school are attending lectures. 1.4.5. Collected systematically for a predetermined purpose - in statistics, to get accurate answers, the data must be collected in a systematized order keeping the pre-determined goal in mind. As already mentioned, research and statistics go hand in hand. So, to have reliable and accurate data collection for the nuanced analysis, research objectives have to be kept in mind. Along the same line, the order of collection of data should not be done in a haphazard manner, i.e. It should be done in a scientific way following every step of research. For example, if the research problem is identifying the impact of leadership on employee behaviour, the researcher needs to understand and clarify the research questions first, identify the research design based on propositions or hypotheses, and then start the data collection. In the same line, the data collection should start by identifying the respondents who can give the answers accurately. So, in this example researcher need select leaders and their immediate employees to collect the data. Further, after putting them, the data collection method should be identified and finalized. It can be a questionnaire method, an interview method, or so on. Based on the research problem; the researcher will have more clarity about the research design and the prospective respondents, which will thus help in the systematic collection of data 1.4.6. Placed in relation to each other The statistics are collected and analyzed to understand future trends and patterns. Similarly, statistics help us understand how things have changed over time. This makes us conclude that statistics need to have comparable statistical data. The data has to be collected in such a manner 03 that enables chronological or region-wise comparison for geographical comparisons, for example, statistics collected on the per capita income of employees working in developed nations. All of them can be compared when converted to dollars as the currency all these characteristics make statistics very crucial in the field of research. Without following these characteristics, data cannot be called statistics. Like it is well said all statistics are numerical statements of facts, but all numerical statements of facts and not statistics. 1.5 STATISTICAL METHODS Many researchers also define statistics as statistical methods. Very often, it is known as a singular sense of defining statistics. A few definitions given by researchers on defining statistical methods are given below. According to professor A.L. Bowley, "statistics may be called the science of counting". On another occasion, professor Bowley said, " statistics may be called the science of averages". Another definition was given as statistics is the science of measurement of the social organism, regarded as a whole in all its manifestations". In the first definition by professor Bowley, the emphasis was on collecting data, including counting. However, the other aspects of statistics, like presentation analysis interpretations, are entirely ignored. In the second definition, the main emphasis was on averages considering statistics as only enabling the calculation of averages. However, statistics involves a large number of statistical methods and the is premises. It is not only about averages. Statistics do include tests like dispersion, skewness, correlation, regression etc., although all of these tests for wholly ignored from the second definition. Talking about the third definition given again, statistics was restricted to the scope of sociology, including a man and his activities. With criticism about these definitions, professor Bowley recommended that "statistics cannot be confined to any scope". Boddington also defines statistics as statistical methods as "the science of estimates and probabilities". Although again in, this definition is only on estimates and probabilities, which are just a tiny part of statistical methods. The most comprehensive definition given to define statistical method was given by Croxton and Cowden. They defined statistics as "statistics may be defined as a science of collection, presentation, analysis and interpretation of data". Croxton and Cowden's small yet comprehensive definition provided the significant steps involved in statistical methods. It has already been emphasised that statistics have to have data collected systematically. Thus, this definition clarifies the four stages of statistical investigation given below. 1.6 STAGES OF STATISTICAL INVESTIGATION The definition provided four different stages of statistical investigation. However, one more day is added to business statistics - an organisation of data. So, let's understand the 5 stages of statistical investigation. 1.6.1 Collection of data: the first step of the statistical investigation is the collection of data. Based on the pre-determined purpose, the data is supposed to be collected in a systematized manner. Since it is the first step, proper care must be exercised. This is important because it creates the foundation for further analysis. If the data is not collected correctly, then that data will not be considered reliable. The data can be collected from two different sources- namely, primary source and secondary source the primary source includes first-hand data collection 04 where a statistician contains the data based on the research objectives. This data is collected based explicitly on the purpose of the research projects. Another way to gather information is the secondary sources. The secondary sources are the published or unpublished sources from where the investigator collects the data and use it in their projects. However, they need to be very cautious regarding the use of secondary sources of data. The reliability and suitability of such data should be tested and only utilised in this study. 1.6.2 Organization of data: the next step is to organise the data. This step includes editing and classification of data. When the data is collected, they can be specific errors in the data, like missing values and irrelevant answers. Similarly, in the case of open-ended questions, the responses can be different for each respondent. This step of the data organisation helps investigators edit the collected data carefully to avoid missing values, inconsistencies, irrelevant answers, vague answers and wrong calculations. Once the data is clarified and revised, the information is classified based on common characteristics. And finally, based on the classification and tabulations, the present data can be more refined. 1.6.3 Presentation of data: once the data is organised in a tabular form, we can present the data through diagrammatic formats the most common way of giving data are bar graphs pie charts and histograms. We will discuss all these methods of data presentation in the upcoming chapters. The data submitted using the most appropriate diagrams can provide more order and understanding of the data. 1.6.4 Analysis of the data: depending upon the purpose of the study, tabulations are done for the data. As mentioned, the tabular presentation of data enhances the clarity of the characteristics of the data. Once all three steps are done - collection, organisation and presentation of data, the next step is the data analysis. During the data analysis, the investigator needs to identify the most suitable data analysis method to conclude. For example, if the investigator wants to explain the effectiveness of leaders in an organisation, a questionnaire method can be utilised for data collection, which can then be organised in a table, avoiding the missing values and outliers and then descriptive analytics, including measures of location and measures of dispersion, can be initially utilised to serve the purpose. Further, if a statistician wants to establish a relationship between leadership and employee stress, then the Pearson correlation test can be used. 1.6.5 Interpretation of data: interpretation of the data is the last step of statistical investigation; however not the simplest one. This step calls for a high degree of skill and experience among researchers to provide practical conclusions. If the data is not adequately investigated, it cannot allow proper interpretation and will give defeated and vague deductions. The correct interpretation is believed to be a valid conclusion of the study and, thus, in decision-making. 05 1.7 FUNCTIONS OF STATISTICS So now, let's understand the functions of statistics 1.7.1. Definiteness: the statistics specifically require a clear and definite form of general statements. The numbers and details must be precisely defined and mentioned for accurate results. For example, saying statements like the GDP of the country has increased. This statement is neither specific nor definite in any sense. While rewriting the same statement as the country's GDP has risen from 11% in 2020 to 15% in 2022, this statement provides more clarity to the readers with a definite meaning. 1.7.2. Simplified mass of figure: along with being definite, statistics helps condense the gathering of data into a few significant figures with proper and more precise meanings. In a raw data set, multiple columns and rows of data exist. Using statistical methods, one can condense the data based on common characteristics and give proper meaning for better interpretations. For example, by reading the census reports on individual salary structures in a country, a researcher might not understand the income of the entire population. However, a simplified value of per capita income can be easily remembered and understood by everyone. 1.7.3. Facilitate comparison: another function of statistics is to facilitate comparison. When collected systematically, statistics can ensure comparison. Unless the figures can be compared with another figure of the same kind, they are devoid of any meaning in statistics. For example, data on rainfall is collected for one year while the unit of measurement differs in every country. In such a situation, the comparison will not be possible until all data is transformed into the same division. Similarly, when car sales data indicate sales in rs crore for two different years, like 20,000 cr in 2020, increased to 25,000 cr in 2022, it can easily ensure comparisons. So, statistics can provide a more straightforward comparison between data 1.7.4. Formulating and testing hypothesis: statistical methods can help investigators to use different tools and measurements to formulate and test the hypothesis. For example, if the investigator aims to test the effect of increased dopamine on students' performance in the classroom, then using statistical understanding, null and alternative hypotheses can be formulated. If the alternative hypothesis is that dopamine affects students' performance, then a simple linear regression analysis can be used to test the theory. 1.7.5. Helps in prediction - statistics also enables effective prediction for the future. Using statistics, researchers can evaluate the overall trend and utilise the information to forecast future events. For example, data on employee turnover from the last 5 years can be used to predict the prospective number of employees that might leave the organisation in the next 6 months. Based on this prediction, good employee retention policies can be developed well in advance. 1.8 SCOPE OF THE STATISTICS Statistics can be applied in different fundamental and multi-disciplinary areas. For example, statistics can be utilized in marketing to design and re-designing advertisements for products, placements of products on shelves in retail stores, bringing realistic ideas into ads to attract more customers and so on. The list is on and on. The scope of statistics is so vast that it not only helps in 06 proper data collection, summarization and presentation, but it also helps marketers to develop new strategies and deal with problems. 1.8.1 Operations In the field of operations management and supply chain management, statistics support quality management. On the one hand, it helps manufacture the products with ideal size and composition; on the other hand, it helps quality check through random selection of items from the lots. Through a random sample of articles from conveyor belts, managers can check whether the quality of the product is acceptable or rejected. Similarly, on delivery of raw materials, managers and executives can verify if the materials delivered are of prescribed quality or not. 1.8.2 Accounting In accounting, statistics are utilized explicitly in the auditing process. In the auditing process, auditors cross-verify vouchers or bills with transactions shown in ledgers and journals, also known as bookkeeping. For cross-verification, the auditor selects vouchers and statements randomly from the registers and checks them in correspondence to transaction entries. Through this method of random sampling, there are higher chances that frauds and mistakes can be detected, if any. We will discuss random samples in upcoming chapters. 1.8.3 Human resource management In human resource management, information systems use statistics to shortlist candidates for interviews, select employees for training based on training needs, ensure effective performance appraisals, predict future employee turnover and absenteeism, and so on. The advanced usage of statistics is done in human resource analytics. 1.8.4 Banking Similarly, in banking, statistics are used to collect and analyze information to understand future economic conditions and evaluate other external needs to understand every line of business in which they might be directly or indirectly interested. 1.8.5 Others Similarly, statistics can be utilized in investment, purchase, credit and control, personnel management, and research and development. Thus, we can conclude that the scope of the statistics is vast and continuously increasing. Due to this, it is tough to define statistics. At the same time, it is unwise to explain the exact usage of statistics as it is permeated almost every aspect of our lives. 1.9 LIMITATIONS OF STATISTICS From the above discussion, it is pretty clear that statistics is way too much value in our lives. However, like every concept has some limitations, so is the case with statistics. It is not a magical device which provides a correct solution to problems. It requires a proper data collection structure for critical analysis to attain valid conclusions. Without a systematic process, statistics can lead investigators to draw wrong conclusions. So, let's see and discuss the limitations of statistics. 07 1.9.1. Does not deals with isolated measurements - statistics do not deal with individual or isolated measurements. It requires aggregated or average estimates for analysis and decision- making. For example, to make policies for employee retention, hr managers need to take information on compensation and rewards, the experience of employees, job satisfaction and other related factors. But the observations on each element have to be taken as the average of all responses of individuals. No policies can be developed based on only individual reactions. 1.9.2. Deals with numbers- statistics involves numeric facts and figures. It does not enable the usage of qualitative or text data. For example, data on males and females, usually considered nominal data, can be used for making a diagrammatic presentation. However, it cannot be used for analysis. For analysis, these labels must be transformed into codes, like male-0 and female-1. 1.9.3. Are true only on an average- the conclusions in statistics are based on the purpose of research, hypothesis developed, research design adopted and disciplinary background of researcher. So, it is not true that the conclusion made in one analysis is universally true. A generalisation of results is required to verify the application of the decision in another context. 1.9.4. Only a means- statistical methods furnish only one way of studying a problem. They may not be the best under all circumstances. So, it should be carefully noted that statistics is only a means, not an end. It analyses the facts and throws light on the actual situation. Complete dependence on statistics male lead to fallacious conclusions in many cases. Proper consideration of factors affecting and cautious organization and analysis is required for accurate conclusions. 1.9.5. Can be misused- the most significant limitation of statistics is that it is liable to be misused. It is essential to understand that statistics are likely, and they can be moulded in any manner to establish right or wrong conclusions if the statistical findings are based on incomplete information, one may arrive at false conclusions. That's why it requires experience and skills to draw sensible conclusions from the data. Otherwise, there is every likelihood of a wrong interpretation. 1.10 SUMMARY The chapter involved a basic understanding of the term business statistics. Below given is the summary of this chapter. ¡ Statistics and research: research and statistics go hand-in-hand. However, research can be conducted without statistics, but statistics without research in the business world will not help in attaining practical conclusions and decisions. ¡ Statistics definition: statistics can be defined as both statistical methods and data. Different scholars provided different meanings to explain the crux of statistics and its characteristics. ¡ Characteristics of statistics as statistical data: significant features of statistics as statistical data are that it comprises aggregates of facts affected to a marked extent by a multiplicity of causes, numerically expressed, enumerated or estimated according to 08 reasonable standards of accuracy, collected in a systematic manner for a predetermined purpose and placed with each other. ¡ Statistical methods: in the singular sense, statistics is also defined as statistical methods. ¡ Stages of statistical investigation: statistics involves 5 stages- collection, organisation, presentation, analysis and interpretation. ¡ Scope of statistics: statistics are helpful in every discipline, from operations management to production management to human resource management and marketing. It is also beneficial in fundamental grounds of accounting and banking. ¡ Limitations of statistics: statistics have certain limitations which restrict their usage in research work. It only uses aggregates and numerical data for analysis. It is determined to be correct on average and can lead to misleading results if not conducted with diligence and understanding of the research purpose. 1.11 SELF ASSESSMENT QUESTIONS Long questions 1. Explain the evolution of statistics 2. Explain how statistics and research are interconnected. 3. Explain the limitations of statistics 4. Discuss the functions of statistics 5. Elaborate on the scope of statistics 6. Explain the characteristics of statistics Short questions 1. How statistics can be helpful in the field of banking 2. How statistics can be misused 3. Define statistics as statistical methods 4. Define statistics as statistical data Fill in the blanks 1. Yule and Kendall defined statistics as "quantitative data affected to a marked extent by a ……………………………………". (multiplicity of causes) 2. All the facts and figures are ……………………….. Expressed in statistics.(numerically) 3. Boddington defined statistical methods as "the science of …………………………". (estimates and probabilities) 4. In statistics, to get accurate answers, the data must be collected in a systematized order keeping the …………………………. in mind. (purpose) 09 True/ False 1. Statistics specifically require a clear and definite form of general statements. (True) 2. Statistics help us understand how things have changed over time. (True) 3. To have reliable and accurate data collection for the nuanced analysis, research objectives need not be kept in mind. (False) 4. In statistics, unless the figures can be compared with another figure of the same kind, they are devoid of any meaning. (True) Multiple choice question 1. Statistical investigation includes 5 steps process. Choose the correct process from the below - a. Collection, organisation, presentation, analysis and interpretation b.Collection, presentation, organisation, analysis and interpretation c. Collection, organisation, analysis, presentation and interpretation d.Collection, presentation, analysis, organisation and interpretation 2. In accounting, statistics is used explicitly for a. Quality assurance b. Auditing process c. Recruitment d.None of the given options 3. Statistics are prone to be misused because a. They can be moulded to establish right or wrong conclusions b. It helps condense the collected data into a few significant figures with proper and more precise meanings c. It is affected to a marked extent by a multiplicity of causes d.All of the given options 4. Choose which of the following are not a function of statistics a. Facilitate comparison b. Formulating and testing hypothesis c. Helps in prediction d. Includes only qualitative data 10 1.12 REFERENCES ¡ Black, K. (2019). Business statistics: for contemporary decision making. John Wiley & Sons. ¡ Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., & Cochran, J. J. (2020). Modern business statistics with Microsoft Excel. Cengage Learning. 11 MODULE - 2 DATA PRESENTATION STRUCTURE ¡ Introduction ¡ Frequency distributions ¡ Steps to develop frequency distributions ¡ Class midpoint ¡ Relative frequency ¡ Cumulative frequency ¡ Quantitative data graphs ¡ Qualitative data presentations Learning Objectives ¡ To understand the meaning of diagrammatic presentations ¡ To understand the frequency distributions ¡ To understand quantitative modes of data presentation ¡ To understand the qualitative modes of data presentation. 2.1 INTRODUCTION In the era of data analytics and data science, it is essential to understand how these data can be summarized and presented to ensure effective communication of results. Thus, in this chapter, we will discuss different modes of diagrammatic presentations using qualitative and quantitative data with the help of examples. The first step toward analyzing the data is to explore the data using tabular or diagrammatic presentations. 2.2 FREQUENCY DISTRIBUTIONS: The frequency distributions refer to a summary of data presented in class intervals and frequencies. The frequency distributions are relatively more accessible to construct. However, to make it properly, standard rules and guidelines are to be followed. Depending upon the nature of the data, different shapes and designs of frequency distribution tables can be developed. Even if the data are identical, various forms can be created, depending upon the taste of individual researchers. 12 Table 2.1: Data on the number of employees fallen sick in 12 months in the year 2021 34 12 23 34 20 23 56 16 67 11 20 10 STEPS TO DEVELOP FREQUENCY DISTRIBUTIONS Determine the range of the raw data: The range often is defined as the difference between the largest and smallest numbers. So before moving ahead, researchers need to identify the range of numbers. For example, in the data given in table 2.1, the range of the data is 67-10 = 57. Determine how many classes it will contain: next step is to identify how many classes one needs to present the data in distribution effectively. A standard rule for it is to select between 5 and 15 classes. If the number of class are too few, then the data summary may be too general to be of any use to researchers. While if the number of classes is very high, then the frequency distribution will not aggregate the data correctly and will not be helpful to researchers. Although, the decision of a number of classes is arbitrarily taken by the researcher's diligence. See table 2.2; the data given in table 2.1 was divided into 6 classes with the frequency of values within each classes. Table 2.2: Data on the number of employees fallen sick in 12 months in the year 2021 (grouped) Classes Frequency 10-20 4 20-30 4 30-40 2 40-50 0 50-60 1 60-70 1 Determine the width of the class interval the next step is to identify the class width. The formula to calculate the class width = range of data by a number of classes. For the data given in table 1.1, the approximation would be = 57 = 9.5 6 So the class width can be considered as 10, rounding off to the following whole number, which is 10. The frequency distribution must start at a value equal to or lower than the lowest number of the ungrouped data and end at a value equal to or higher than the highest number. So, based on the minimum days of absenteeism as 10 and the highest at 67, the frequency distribution can start at 10 and end at 70. (see table 2.2) 2.3 CLASS MIDPOINT The midpoint of each class interval is known as the class midpoint. It is also called a class mark. 13 For data summarization and presentation, class midpoints are essential. In other words, a class midpoint can be defined as a value halfway across the class interval and calculated as the average of the two class endpoints. We can calculate midpoints using the given formula = Where, LL represents the lower limit of the class interval UL represents the upper limit of the class interval Thus, using this formula, for the data in table 1.3, midpoints can be calculated as (10+20) / 2 = 15 2.4 RELATIVE FREQUENCY The following form of frequency which are used for depicting table and diagram indicating relative proportions are related to frequency. It can be defined as the proportion of the total frequency in any given class interval in a frequency distribution. Table 2.3 lists the relative frequency for all classes. The relative frequency can be calculated by - class frequency total frequency i.e. for example, for the class 10-20, the relative frequency would be = 4/12 = 0.333333 Table 2.3: Class Midpoints, Relative Frequencies, and Cumulative Frequencies for absenteeism Class Frequency Mid Relative Cumulative interval value frequency frequency 10-20 4 15 0.33 4 20-30 4 25 0.33 8 30-40 2 35 0.17 10 40-50 0 45 0 10 50-60 1 55 0.08 11 60-70 1 65 0.08 12 Total 12 2.5 CUMULATIVE FREQUENCY The cumulative frequency refers to running the total of frequencies through the classes of a frequency distribution. In other words, the cumulative frequency for a class interval can be calculated by adding the frequency estimate of the class interval with the prior cumulative total. As shown in Table 2.3, class 10-20 is the same as its frequency, i.e. 4, while for class interval 20- 30, the cumulative frequency will be 4+4 =8. 14 Similarly, for other classes, the cumulative frequency can be calculated. This process continues through the last interval, at which point the cumulative total equals the sum of the frequencies. In other words, to verify the estimates in cumulative frequency, we can cross-check that the final number is the cumulative frequency column with the total frequency column. In example 2.3, the final cumulative value is 12, equal to the total frequency. 2.6 QUANTITATIVE DATA GRAPHS Quantitative data are the data which are numerically expressed and are measured by interval and ratio scale of measurement. To effectively present any data, it is suggested to transform it into graphs and plots. There are five major types of quantitative data graphs - 1) histogram. 2) frequency polygons 3) Ogives 4) dot plot 5) Stem and leaf plots 2.6.1 HISTOGRAM The histogram is a series of contiguous bars or rectangles that denote the frequency of data in given class intervals. If the class interval is equal, then the frequency of the values in each class interval is represented through the height of the bars. While on the other hand, if the class intervals are unequal, then the relative comparisons of class frequencies are depicted in the areas of the bars (rectangles). As shown in figure 2.1, a histogram involves an x-axis labelled with class endpoints and a y-axis marked with their respective frequencies. Here, the x-axis is also known as abscissa, while the y-axis is named ordinates. Thus, in the histogram are drawn drawing a horizontal line from the frequency value of one class endpoint to another class endpoint and interlinking each one vertically from the frequency value to the x-axis to form a series of bars, as shown in figure 2.1. Figure 2.1: Histogram depicting the data on absenteeism 15 2.6.2 FREQUENCY POLYGONS Frequency polygons are another way of presenting a quantitative data set. Theoretically, it is constructed by scaling class midpoints along the horizontal axis and the frequency scale along the vertical axis. In other words, it is similar to a histogram. But instead of bars or rectangles, it is based on plotting a dot at the class midpoint and then connecting each dot by a series of line segments. As seen in figure 2.2, the dots are plotted and then a line segment connecting each dot is drawn. Figure 2.2: Excel produced a frequency polygon for the days of absenteeism 2.6.3 OGIVES An ogive is another way of presenting quantitative data. However, unlike histograms and polygons, it is made based on cumulative frequencies. So, a graphical presentation of cumulative frequency values. In the construction of ogives, the following steps are taken ¡ Labelling x-axis with class endpoints and y-axis with frequencies ¡ Scaling y-axis enough to include the total of frequencies ¡ Starting with plotting a 0 at the beginning of the first class ¡ Preceding with marking each dot at the end of each class interval for the cumulative values. ¡ Lastly, connect all dots to complete the presentation. The diagrammatic presentation of ogive is most useful when the researcher aims to see running totals. For example, if researchers are interested in controlling the overall production costs, an ogive can depict the cumulative costs of a financial year. As shown in figure 2.3, a particularly steep slope occurs in the 20-30 class interval, signifying a significant jump in class frequency totals. 16 Figure 2.3: Ogive using excel 2.6.4 DOT PLOTS Dot plots are simple charts generally made to display continuous, quantitative data. Each data value is plotted along the horizontal axis using a dot. If multiple data points have the same values, the dots will stack up vertically. If there are many close points, it may not be possible to display all data values along the horizontal axis. The dot plots are helpful mainly for observing the overall shape of the data distribution points along with intervals with identifying data values or intervals for which there are groupings and gaps in the data. To simplify, dot plots are made using the following steps- Table 2.4: Re-organize the data for dot plots Value Frequency Values Frequency 10 5 10 0 20 2 10 1 30 4 10 2 40 3 10 3 50 1 10 4 60 1 20 0 70 0 20 1 40 0 40 1 40 2 40 3 50 0 50 1 50 2 60 0 70 0 17 ¡ First, reorganize the data into a "long" format (as shown in table 2.4) ¡ Step 2: Create a dot plot using the "scatterplot" option in excel as shown in figure 2.4. Figure 2.4: Excel steps for scatterplot. ¡ Step 3: Customize the chart as shown in figure 2.4 - Delete the gridlines. - Delete the title. - Increase the size of the individual dots. - Change the x-axis to only span from 1 to 7. Figure 2.5: Dot plot for data on absenteeism using excel. 2.6.5 STEM AND LEAF PLOTS Another way of organising and presenting quantitative data using the frequency distribution is the stem and leaf plot. This plot is developed by separating the left and right digits for each number of the data into a stem and a leaf. The process has three steps ¡ Left most digits are considered as stem and consist of higher valued digits 18 ¡ Right-most digits are considered leaves and consist of lower values. ¡ If a set of data has only two digits, the stem is the value on the left, and the leaf is the value on the right Let's develop a stem and leaf plot using the example in table 2.1. The stem and leaf plot are depicted in Figure 2.6 Figure 2.6: Stem and leaf plot 1 0 1 2 6 2 0 0 3 3 3 4 4 5 6 6 7 The primary benefit of using stem and leaf plots is that the investigator can readily see whether the scores are in the upper or lower end of each bracket and also determine the spread of the scores. Further, this plot can help determine the spread of the scores. Another advantage of using this plot is that the values of the original raw data are retained. In contrast, other frequency distributions and graphic presentations use the class midpoints to depict the values in a class interval. 2.7 QUALITATIVE DATA PRESENTATIONS Now let's discuss the qualitative graphs for data presentations. We are going to discuss two types of qualitative data graphs 1. Pie charts, 2. Bar charts, 2.7.1 PIE CHARTS A pie chart is a circular depiction of data. Under this diagrammatic presentation, the area under the whole pie represents 100% of the data and slices of the pie represent a percentage breakdown of the sublevels. As shown in Figure 2.7, the data on the gender of respondents is depicted in a pie chart based on the values given in Table 2.5. Table 2.5: Data on gender of respondents FEMALE 13 MALE 7 19 Figure 2.7: Pie chart using excel Here we can see the total area under the pie is 100%, and the angle is 360 degrees. So, to identify the proportion, we can use the formula as female = 13/20*100 = 65 % & male = 7/20* 100 = 35%. Thus, based on the example shown in figure 1.7, we can conclude that the respondents in the survey were majorly females (65%) while males were only 35%. Now converting this into degrees, we can calculate the angle as 13/20*360 = 234 degrees for females. Similarly, for males = 7/20 * 360 = 126 degrees. Thus, the pie chart shows the relative magnitude of the part to the whole. Pie charts are widely used to depict variables like market share, resource allocations, investment patterns etc. 2.7.2 BAR GRAPHS Another simple way of presenting qualitative data is a bar graph or bar chart. In a bar chart, we can show two or more categories along with one axis and a series of bars along the other axis, one for each category. Usually, the length of the bar indicates the magnitude of the measure. As shown in figure 2.8, the data given in table 1.5 is shown as bar chart. The bar graph is qualitative because the categories are non-numerical, and it may be either horizontal or vertical. Figure 2.8: Bar chart using excel 20 2.8 COUNTRY MAPS Apart from the above-given diagrams for qualitative data, we can also use country or state-based data and use country maps in excel to depict. The steps to develop a country map is stated below 1. Summarize the data in excel 2. Go to insert tab and click on maps option under the charts tab as shown in figure 2.9 Figure 2.9: steps to choose country maps in excel Click on map options and the diagram will automatically shown as shown in figure 2.10. Figure 2.10: Country maps using excel 21 However, there are a few points that are needed to be kept in mind. a. Maps charts can only plot high-level geographic details only. No latitude or longitude, or street address can be mapped in it. b. Maps charts also support one-dimensional display only. c. Online connections are required to create new maps in excel or append data to existing maps. d. Existing maps can be viewed offline. 2.10 SUMMARY In the era of data analytics and data science, it is essential to understand how these data can be summarized and presented to ensure effective communication of results Frequency distributions: The frequency distributions refer to a summary of data presented in class intervals and frequencies. Class midpoint: a class midpoint can be defined as a value halfway across the class interval and calculated as the average of the two class endpoints Relative frequency: It can be defined as the proportion of the total frequency in any given class interval in a frequency distribution Cumulative frequency: refers to running the total of frequencies through the classes of a frequency distribution. Quantitative data graphs: Quantitative data are the data which are numerically expressed and are measured by interval and ratio scale of measurement. To effectively present any data, it is suggested to transform it into graphs and plots. There are five major types of quantitative data graphs - 1) histogram, 2) frequency polygons, 3) Ogives, 4) dot plot, 5) Stem and leaf plots Qualitative data presentations can be done using two types of graphs 1. Pie charts, 2. Bar charts, 2.11 SELF ASSESSMENT QUESTIONS Long questions 1. According to T-100 Domestic Market, the top seven airlines in the United States by domestic boardings in a recent year were Southwest Airlines with 81.1 million, Delta Airlines with 79.4 million, American Airlines with 72.6 million, United Airlines with 56.3 million, Northwest Airlines with 43.3 million, US Airways with 37.8 million, and Continental Airlines with 31.5 million. Construct a pie chart and a bar graph to depict this information. 2. According to the National Retail Federation and Center for Retailing Education at the University of Florida, the four main sources of inventory shrinkage are employee theft, shoplifting, administrative error, and vendor fraud. The estimated annual dollar amount in shrinkage ($ millions) associated with each of these sources follows: Employee theft $17,918.6 Shoplifting 15,191.9 Administrative error 7,617.6 22 Vendor fraud 2,553.6 Total $43,281.7 Construct a pie chart and a bar chart to depict these data. 3. The following data represent the number of passengers per flight in a sample of 50 flights from Wichita, Kansas, to Kansas City, Missouri. 23 46 66 67 13 58 19 17 65 17 25 20 47 28 16 38 44 29 48 29 69 34 35 60 37 52 80 59 51 33 48 46 23 38 52 50 17 57 41 77 45 47 49 19 32 64 27 61 70 19 a. Construct a dot plot for these data. b. Construct a stem-and-leaf plot for these data. What does the stem-and-leaf plot tell you about the number of passengers per flight? 4. Construct a histogram and a frequency polygon for the following data. Class Interval Frequency 30-under 32 5 32-under 34 7 34-under 36 15 36-under 38 21 38-under 40 34 40-under 42 24 42-under 44 17 44-under 46 8 Short questions 1. Define pie charts. 2. Define Bar charts 3. Define the advantages of stem and leaf plots 4. Explain the uses of country maps. True and false 1. Pie charts are used for depicting absolute frequencies (False) 2. Bar charts are similar to histogram (False) 23 3. Ogive are developed based on cumulative frequency (True) 4. Country maps can be used to depict street addresses (False) Fill in the blanks 1. The midpoint of each class interval is also known as ……………………. (class mark) 2. An ogive is another way of presenting ………………….. data. (quantitative) 3. The ……………………… frequency refers to running the total of frequencies through the classes of a frequency distribution. (cumulative frequency) 4. In a stem and leaf plot, ……………….. digits are used to depict as leaf. (right) 2.12 REFERENCES ¡ Black, K. (2019). Business statistics: for contemporary decision making. John Wiley & Sons. ¡ Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., & Cochran, J. J. (2020). Modern business statistics with Microsoft Excel. Cengage Learning. 24 MODULE - 3 MEASURES OF CENTRAL TENDENCY STRUCTURE ¡ Introduction to measures of central tendency ¡ Characteristics of a good average ¡ S Symbol ¡ Different types of measures of central tendency ¡ A.M. of Grouped frequency distribution ¡ Composite A.M. ¡ Advantages and disadvantages of A.M. ¡ Geometric Mean ¡ Advantages and disadvantages of G.M. ¡ Uses of G.M. ¡ Harmonic Mean ¡ Relationship among A.M., G.M. and H.M. ¡ Median ¡ Advantages and Limitations of Median ¡ Mode ¡ Relationship between Mean, Median and Mode ¡ Quartiles, Deciles and Percentiles 3.1 LEARNING OBJECTIVES After going through this unit, you will be able to: 1. Understand what measures of central tendency are 2. Learn the different measures of central tendency 3. Explain the different types of means 4. Understand the relationship between mean, median and mode 5. Explain Quartiles, deciles and percentiles 3.2 INTRODUCTION While working with data, we may sometimes need such a numerical expression which can tell us certain characteristics of the whole data set, such as the central most value, the lowest value, the highest value or the value which appears the most frequently. One such type of measure which can tell us the average value of a distribution are known as the 'measures of central tendency' or 25 also known as 'Averages', more popularly. Such a measure which represents the middle most value should obviously be greater than the smallest value and less than the highest value. A measure of central tendency or an average of a certain distribution is nothing but a representative value of that distribution which enables us to comprehend in a single effort the significance of the whole. It should be a value which lies between the two limits, i.e, the highest and lowest points in the data set, possibly at the centre, where most of the values of the series cluster. Measures of central tendency or averages are additionally, also called as measures of central location. These are arithmetical measures intended to represent the central value of a data set. We can say that an average of a distribution (of the values) of any variable (say weight of some students in a class in cms) is a representative value of that variable. In any observation set, the representative value of a distribution usually lies at or near the centre of the distribution. This happens due to the inherent tendency of a distribution of data of any kind that the major part of the values gets concentrated at the centre. Since this average is reflective of this tendency of the data, hence, the average is called a measure of central tendency. Depending upon the nature of a distribution, different methods of obtaining the representative value have been evolved and as a result, we have several averages or measures of central tendency. 3.3 CHARACTERISTICS OF A GOOD AVERAGE According to statisticians Yale and Kendall, an average will be considered good or efficient if it possesses the following characteristics: ¡ An average should be easily understandable ¡ It should be rigidly defined. The definition should be such that it's interpretation is not subjective in nature. ¡ The average should be such that it can be calculated easily. ¡ The average of a variable should be based on all values of the variable. ¡ The average should not reflect significant change in it's value if there is a change in sample. In simple terms, an average should possess sampling stability. ¡ Such an average should be such that it is not unduly affected by extreme values, i,e, the formula for average should be such that it does not show undue large change due to the presence of one or two very large or very small values in the distribution set. 3.4 S SYMBOL In order to denote sum (i.e, a total of certain quantities), the Greek letter S (capital sigma) is used. For example, if a variable x takes the values x1, x2, x3…xn, then the sum of these values of the n n variable x i.e., (x1 + x2 + x3+… + xn) is denoted by S t =1 xi or S x. The symbol S t = 1xi means that the lower limit of i is 1 and the upper limit of i is n, i.e., i takes the value 1, 2, 3, …, n and the symbol S means that all the values of xi for i = 1, 2, 3,…, n are to be added. Again the symbol S x implies 'sum of the values of x'. Illustration: Express with the help of S symbol: a. x4 + x5 + x6 + x7 + x8 + x9 + x10 Solution: x4 + x5 + x6 + x7 + x8 + x9 + x10 = S10i=4 xi 26 3.5 DIFFERENT TYPES OF MEASURES OF CENTRAL TENDENCY The following three types of averages or measures of central tendency are used: a) Mean b) Median c) Mode a) Mean: Arithmetic mean is the average of a group of observations and is calculated by adding all the numbers and then dividing the sum so obtained by the number of observations. Because it is the arithmetic mean out of the three types of means that is most commonly used by statisticians, the arithmetic mean is more commonly known as only 'mean'. Here, the mean for the population is represented by the Greek letter mu ( ). And the mean for the sample is denoted by x ?. While talking of mean, there are three types of means, namely: i) Arithmetic mean: (AM) ii) Geometric mean: (GM) ii) Harmonic mean: (HM) i) The A.M. of a variable x is denoted by the symbol x ? and is defined to be the sum of the values of x divided by the number of values of x. Formula of AM in case of individual series: x: x1, x2, x3, …, xn, `x will be: t `x = (x1+x2+x3+?+xn) = S i=1 xi = Sx n n n Formula in case of ungrouped frequency distribution: x x1 x2 x3 … xn f f1 f2 f3 … fn 3.6 A.M. OF GROUPED FREQUENCY DISTRIBUTION In discussing grouped frequency distribution, while calculating the A.M., we make an assumption that the observations included in a class represented by a class interval are concentrated around the center of the class interval. To obtain the A.M. of a grouped frequency distribution, we consider the frequencies of the class intervals to be the frequencies of the mid values of the corresponding classes. By this process, we successfully convert a grouped frequency distribution to a discrete (or ungrouped) frequency distribution. Hence by applying 27 the A.M. formula for discrete or ungrouped frequency distributions we can find the arithmetic mean of a grouped frequency distribution. Illustrative examples (1): i. Find the A.M. of the following numbers 5, 8, 10, 15, 24 and 28. ii. Find the A.M. of the following series: x: 4, -2, 7, 0 and -1. Solution: 5+8+10+15+24+28 i. The required A.M. = = 90/6= 15. 6 4+ −2 +7+0+(−1) ii. The required A.M. = = 8/5 = 1.6 5 Example 2: Find A.M. of the following frequency distribution: x: 1 2 3 4 5 6 7 8 9 f: 7 11 16 17 26 31 11 1 1 Solution: First of all we shall prepare the following frequency table: x f fx 1 7 7 2 11 22 3 16 48 4 17 68 5 26 130 6 31 186 7 11 77 8 1 8 9 1 9 N= 121 åfx = 555 åfx \ A.M = = 555/121 = 4.59 (approx.) N Note: All measures of central tendency or averages of a distribution will possess the same unit of the distribution. 3.7 COMPOSITE A.M. The A.M. of two or more distributions is called a combined or composite A.M. If there are n1 values in a distribution (d1) where the A.M. is x ?1 and similarly the A.M. of n1 values of yet another set of distribution (d2) is x ?2 then the A.M. of the combined distributions (d1 and d2) is given by: Example: The average marks obtained by two groups of students in an examination are 75 and 85. If the average marks of all the students is 80, find the ratio of students in the two groups. Solution: Let x denote marks of all the students, x1 denote marks of the first group, x2 denote marks of the second group, n1 denote no. of students of the first group and n2 denote number of students of the second group. 28 3.8 ADVANTAGES AND DISADVANTAGES OF A.M. Advantages: ¡ A.M is easy to determine and understand. ¡ The A.M. is based on all values of the distribution. ¡ It can be used for further algebraic treatment. ¡ The formula for A.M is rigidly defined implying that for a given series, the value of A.M. remains unique. ¡ It provides a good basis for comparison. ¡ The values of a series need not be arranged in any order for calculating the A.M. ¡ If the A.M. and the number of observations in the series are known, then we can also find out the sum of the distribution. Disadvantages: ¡ The A.M. is unduly affected by extreme (i.e, very large or small) values. ¡ The A.M. cannot be computed even if one of the values in the series is missing. ¡ The determination of the A.M. in case of a grouped frequency distribution can be misleading as it is based on an unrealistic assumption that the observations of each class is concentrated around the centre of that class. 3.9 GEOMETRIC MEAN If a variable x takes the values x1, x2, x3,..., xn, then the nth root of the product of these n values is called the geometric mean of the variable x and is denoted by G. 1/n Thus, G = (x1 x2 x3 … xn) Again if the frequencies of x1, x2, x3, …, xn are f1, f2, f3,…, fn respectively then the GM of x will be defined as: G = ( x1f1 x2f2 x3f3 … xnfn) 1/N, N = Sf 29 1.7.1 Advantages and disadvantages of G.M. : Advantages: ¡ G.M. is rigidly defined. ¡ G.M. is based on all values of the distribution. ¡ It is possible to do further mathematical treatment in case of G.M. ¡ In comparison to A.M., G.M. is less affected by extreme values. Disadvantages: ¡ G.M. is not that easy to determine and understand. ¡ G.M. of a distribution cannot be determined if there is even one negative value in the series. Also, if there is at least one zero value, then the G.M. will be zero. 3.9.1 Uses of G.M. ¡ G.M. finds extensive use in averaging ratios, rates and percentages. ¡ As population increases in geometric progression, in determining the average rate of increase, G.M. is used. ¡ G.M. is considered to be the top average in the construction of index numbers. 3.10 HARMONIC MEAN Harmonic mean is the reciprocal of the A.M.s of the reciprocals of values in a distribution. If a variable takes on the values of x1, x2, x3, …, xn, then the harmonic mean of x which is denoted by H is given by: Example: Determine HM of the following numbers: 46.1, 21, 127, 202 Solution: Calculations for H.M. 30 The H.M. is given by: = 4/ 0.0429 = 93.24. 3.10.1 Advantages and disadvantages of H.M.: Advantages: ¡ The H.M. of a distribution is based on all the observations. ¡ It is rigidly defined. ¡ It is capable of further mathematical treatment. ¡ It is suitable in case of those series that have a wide dispersion. Disadvantages: ¡ It is difficult to calculate and understand. ¡ If even a single value in the distribution is zero, then the H.M. cannot be computed. ¡ It gives more importance to smaller values. 3.10.2 Uses of H.M.: Harmonic mean is used most frequently in cases of finding out the average speed of an object that has traversed equal distances in different times with different speeds. To find mean mileage of a car that traverses equal distances with different mileage, the H.M. is obtained. 3.11 RELATIONSHIP AMONG A.M. , G.M. AND H.M. a. For any finite number of positive values, A.M ³ G.M. ³H.M. b. For any two positive numbers, A.M. x H.M. = (G.M.)2 Note: ¡ Although A.M. is for all values, i.e., positive, negative and zero, G.M. is defined for positive values only and H.M. is defined for non-zero values. ¡ If x1 = x2, then A.M. = G.M. = H.M. If x1 x2, then A.M > G.M. > H.M. ¡ The above property holds true for any finite number of positive values. 3.12 MEDIAN Median is the second type of average that is used. Median is the middle most value in a distribution when the said distribution is arranged either in ascending or descending order. It 31 divides the distribution into two equal parts. Thus there are equal number of observations on the right and left of the median value, i.e, the number of observations greater than and less than the median are equal. For determining the median of an individual series, we have to make sure that all the values in a distribution are arranged in a definite order, i.e, whether those values are in ascending or descending order or not. If the values are not in a definite order, then these values have to be arranged in either an ascending or descending order. For a distribution with odd number of terms (value), the median is the middlemost value. If there are even number of terms in the array, then the median is the average of the two middle numbers. Symbolically, for odd number of terms in the distribution, Median = ((n+1)/2)th value from the beginning or the end For even number of terms in the distribution, Median = average of the n/2th value and the ( n/2 +1 )th value Example: Determine median for the following series: i. 77, 73, 72, 70, 75, 79, 78 ii 94, 33, 86, 68, 32, 80, 48, 70 Solution: i. Arranging the values of the series in ascending order, we get 70, 72, 73, 75, 77, 78, 79 No. of terms in the series = 7 The required median = (7+1)/2 = 4th term = 75. ii. Arranging the series in ascending order, we get 32, 33, 48, 68, 70, 80, 86, 94 No. of terms in the series = 8 = even number The required median = average of the n/2th value and the ( n/2 +1 )th value n/2th value = 68, ( n/2 +1 )th value = 70 The required median = (68+70)/2 = 4th term = 69. Note: By arranging the terms in descending order, the same value of median will be obtained. 3.12.1 Median of an ungrouped frequency distribution: To determine the median of an ungrouped frequency distribution, we first have to arrange the distribution in a definite order and then form a cumulative frequency table. If the number of observations is odd, then the (N+1)/2 th (N is the total frequency) tern will be the median. If it is so that (N+1)/2 is greater than a term x but less than or equal to another term y, where both x and y are two consecutive values in the cumulative frequency column, then the observation whose cumulative frequency is y shall be the median. However, if there are even number of terms, then the A.M. (average) of the N/2 th and the (N/2 +1 ) th terms will be the median. 32 3.12.2 Median of a grouped frequency distribution: While computing the median of a grouped frequency distribution, the cumulative frequencies of the various class intervals has to be found out first. Then we have find out the median class. The median class is the class which contains the median value. We find out the median value by applying the same principles we did in case of the ungrouped frequency distribution (for odd number terms, N/2th term is median and for even number of terms, average of the N/2 and [N/2 +1] the term shall be the median). After detecting the median class, the particular median value is determined by using the following formula: Where, L = Lower class limit (lower class boundary) f = Frequency i.e., simple frequency of the median class fc = Cumulative frequency of the class preceding the median class N = Total frequency I = Length of the median class (h maybe used in place of I) Note: This formula for obtaining the median in case of a grouped frequency distribution holds true only when the distribution is in ascending order. If the distribution is in a descending order, it has to be arranged in an ascending order first, in order to apply the above formula. 3.12.3 Advantages and Limitations of Median: Advantages: ¡ The median is not affected by extreme values. ¡ It is easily determined and understood. ¡ Median can be determined graphically. ¡ Medians of individual distributions and ungrouped frequency distributions can be determined by mere observations in most cases. Disadvantages: ¡ Unlike the other measures of central tendency, determination of median requires the distribution to be arranged in a definite order if it is not in any order. ¡ Median is not based on all observations of the distribution. ¡ In comparison to mean, it is more affected by fluctuations in sampling. Use of median: In order to determine the average in case of distributions having open-end class intervals, median is the best measure of central tendency. In case of income distribution, median would yield better results. 33 Example: Determine median for the following distribution: Daily 50-55 55-60 60-65 65-70 70-75 75-80 80-85 wages (Rs) No of 6 10 22 30 16 12 15 workers Solution: Table for determining median Weekly wages No. of workers (f) Cumulative frequency ( fc) 50-55 6 6 55-60 10 16 60-65 22 38 65-70 30 68 70-75 16 84 75-80 12 96 80-85 15 111 N = 111 Since no. of classes is 7 (odd), thus median will be ( (N+1)/2)th term. Thus Median will be (111+1)/2 = 56th term. From the cumulative frequency table, we find that the 56th term lies in the class 60-70. Therefore 60-70 is the median class. 3.13 MODE The mode of a distribution is that value which occurs the most frequently in the distribution. It is that distribution whose frequency is the maximum. It should be noted that mode is not unique which means that a distribution may have more than one mode. Distributions that have more than one mode are called bimodal distributions and those that have more than two modes are termed as multimodal. Thus, an individual distribution does not have a mode. Even in case of a discrete frequency distribution, each observation has the same frequency and thus has no mode. In case of an ungrouped frequency distribution, mode can be determined by observation, in most cases. In case of a grouped frequency distribution, mode is obtained by using the following formula: 34 f1 = frequency of the modal class f0 = frequency of the class preceding the modal class f2 = frequency of the class succeeding the modal class I = length of the modal class. (Instead of I, the symbol h may also be used.) Note: ¡ The class (specified by a class interval) whose frequency is the maximum is called the modal class. ¡ The above formula is applicable when all the classes are of equal length. Example: Determine the mode/modes of the following series, if any i. 3, 4, 5, 2, 3, 4, 1, 6, 4; ii. 7, 9, 11, 7, 6, 5, 9, 13; iii. 3, 5, 6, 7, 9, 12, 3, 6, 5, 9, 12, 7 Solution: i. The number 4 is repeated the maximum number of times (3 times). Hence the mode of the distribution is 4. ii. Here we see that both the numbers 7 and 9, appear twice in the distribution. Since the frequency of these numbers is the highest (2), thus 7 and 9 are the two modes of these series. iii. In this series, the frequency of each observation is the same and hence this series has no mode. 3.13.1 Advantages and Limitations of Mode: Advantages: ¡ The mode of an ungrouped frequency distribution can be determined by observation alone. ¡ Mode is not affected by extreme values. ¡ It can be determined graphically. ¡ It is also easy to understand. Disadvantages: ¡ Mode is not based on all observations. ¡ Unlike the other averages, it is not capable of further mathematical treatment. 3.13.2 Use of Mode The usefulness of mode is found in industries and business. A shoe maker can make use of mode by being interested in the modal size of shoes and manufacturing them in larger quantities. 35 Weather forecasts are also based on mode. The mode is an appropriate measure of central tendency for nominal-level data. 3.14 RELATIONSHIP BETWEEN MEAN, MEDIAN AND MODE An experimental or empirical relationship between mean, median and mode has been established by Prof. Karl Pearson. For any distribution the following relationship approximately holds: Mean - Mode = 3 (Mean - Median) 3.15 QUARTILES, DECILES AND PERCENTILES We have learnt so far that mean, median and mode are the measures of central tendency. In addition to this, there are few other measures of central tendency which have been found to be similar to the median and hence these too are studied along with the measures of central tendency. These three measures are Quartiles, Deciles and Percentiles. They indicate quantities at some specific places of distribution. Quartiles Quartiles are those measures of central tendency which divide the distribution into four equal parts when the distribution is arranged in an ascending order. There are three quartiles in a distribution namely Q1, Q2 and Q3. Deciles The nine quantities that divide a distribution into ten equal parts are called the deciles of the distribution. The distribution needs to be arranged in an ascending order. These are denoted by D1, D2, D3, …, D9. Percentiles Percentiles divide a distribution into hundred equal parts. There are 99 percentiles because it takes 99 dividers to separate a group of data into 100 parts. The nth percentile is the value such that at least n percent of the data are below that value and at most (100 - n) percent are above that value. SELF ASSESSMENT Questions Long Answer Questions 1. What do you mean by measures of central tendency? Discuss briefly the methods of measuring averages. 2. What are the different types of mean? Define each of them and state their relative merits, demerits. 3. Define median and mode. Explain how these measures are calculated in case of grouped and ungrouped data. 36 4. Find the missing frequency if the arithmetic mean is Rs 33 thousand. Loss of sales (Rs in thousand) 0-10 10-20 20-30 30-40 40-50 50-60 No. of families 10 15 30 - 25 20 5. Calculate A.M. and median of the distribution. Hence calculate mode using empirical relation between the three. Class intervals 59-61 61-63 63-65 65-67 67-69 Frequency 4 30 45 15 6 6. The following list shows the 15 largest banks in the world by assets according to Standard and Poor's. Compute the median and the mean assets from this group. Which of these two measures do you think is most appropriate for summarizing these data, and why? What is the value of Q2? Determine the 63rd percentile for the data. How could such information on percentiles potentially help banking decision-makers? Bank Assets (US$ millions) Industrial & Commercial Bank of China 4,009 China Construction Bank Corp. 3,400 Agricultural Bank of China 3,236 Bank of China 2,992 Mitsubishi UFJ Financial group 2,785 JP Morgan Chase & Co. 2,534 HSBC Holdings 2,522 BNP Paribas 2,357 Bank of America 2,281 Credit Agricole 2,117 Wells Fargo & Co. 1,952 Japan Post Bank 1,874 Citigroup Inc. 1,842 Sumitomo Mitsui Financial Group 1,175 Deutsche Bank 1,166 Short Answer Questions 1. Mention the desirable characteristics of a good average. 2. Define quartiles, deciles and percentiles. 3. Explain composite mean. 4. State the empirical relation between mean, median and mode. 5. The average weight of the following distribution is 58.5 kg. Weight (kg) 50 55 60 x + 12.5 70 Total No. of men 1 4 2 2 1 10 Find the value of x. 37 6. Suppose the average salaries of elementary school teachers in three cities A, B and C are Rs 13,300, Rs 14,500 and Rs 21,000. Given that there are 13,000, 17,200 and 2,400 elementary school teachers in these cities, find the average salary of all these elementary school teachers in these cities. Fill in the blanks 1. A.M. is very much affected by _______________ (Extreme values) 2. An average___________the given data. (Summarises) 3. For an open ended distribution, _________cannot be determined. (Mean) 4. The arithmetic mean of -1, 0 and 1 is __________ (0) 5. A distribution which has two modes is called ____________ (Bi modal) 6. Median is more suited average for grouped data with ___________classes. (Open end) True and False 1. Mean is one of the measure of central tendency. (True) 2. Median can be computed without arranging the distribution in any definite order. (True) 3. A distribution can only have one mode. (False) 4. Geometric Mean (G.M.) is used to calculate the average speed of an object. (False) 5. For any finite distribution, A.M. ³ G.M. ³ H.M. (True) 6. Mean usually implies A.M. (True) Multiple Choice Questions 1. Which of the following represents median? a. First quartile b. Fourth decile c. Second quartile d. None of the above 2. Which of the following relations among the location parameters does not hold? a. Q2 = median b. P50 = median c. D5 = median d. D6 = median 38 3. Extreme values have no effect on: a. A.M. b. Median c. G.M. d. H.M. 4. If the average of 7, 9, 12, x, 5, 4 and 11 is 9 then x is: a. 13 b. 14 c. 15 d. 8 5. The mean of 8 numbers is 15. After a new number 24 is added, the new mean shall be: a. 8 b. 16 c. 12 d. 10 6. The mode of the distribution of values 5, 9, 7, 7, 5, 9, 6, 7, 5, 4, 3, 4, 1, 5 is a. 5 b. 9 c. 7 d. 3 Problem solving activities A research agency administers a demographic survey to 90 telemarketing companies to determine the size of their operations. When asked to report how many employees now work in their telemarketing operation, the companies gave responses ranging from 1 to 100. The agency's analyst organizes the figures into a frequency distribution. Number of employees working in Telemarketing Number of companies 0- Under 20 32 20- under 40 16 40- under 60 13 60 – under 80 10 80 – under 100 19 Compute the mean, median, and mode for this distribution. 39 Suggested reading: ¡ Chandan, J.S. Statistics for Business and Economics. New Delhi: Vikas Publishing House Pvt Ltd., 1998 ¡ Gupta, S.C. Fundamentals of Statistics. New Delhi: Himalaya Publoshing House, 2006. ¡ Kothari, C.R. Quantitative Technique. New Delhi: Vikas Publishing House Pvt. Ltd., 1984 ¡ Black, K. Business Statistics: Contemporary Decisions Making. Wiley, 2009. References ¡ Black, K. (2009, December 1). Business Statistics: Contemporary Decision Making (6th ed.). Wiley. ¡ Padmalochan, H., & Hazarika, P. (ca. 2007, April 4). A Textbook of Business Statistics (1st ed.) [Print]. S. Chand Limited. ¡ Peck, R., Olsen, C., & Devore, J. (2022, September 30). Introduction to Statistics and Data Analysis (AP(R) Edition) (4th ed.). Brooks/Cole, Cengage Learning. ¡ Quantitative Techniques (New Format). (2013, January 1). Vikas Publishing House. 40 MODULE - 4 MEASURES OF DISPERSION STRUCTURE ¡ Measures of dispersion ¡ Different measures of dispersion ¡ Absolute and Relative measures of dispersion ¡ Different measures of dispersion ¡ Range ¡ Interquartile range and Quartile Deviation (Q.D.) ¡ Co-efficient of Quartile Deviation (Q.D.) ¡ Mean Deviation (M.D.) ¡ Standard Deviation ¡ Empirical relationship between Q.D, M.D. and S.D. ¡ Coefficient of Variation 4.1 LEARNING OBJECTIVES After going through this unit, you will be able to: ¡ Understand what measures of dispersion are ¡ Learn the different measures of dispersion ¡ Explain the difference between absolute and relative measures of dispersion 4.2 MEASURES OF DISPERSION While we know that the measures of central tendency provide a representative value of a single set of data. They determine the middle most values of a distribution or that value which occurs most frequently in the distribution. However, sometimes these measures of central tendency are not fully representative of a given set of data. This happens in the case of distributions where the extent of variation in individual values in relation to the average, or in relation to the other values is large. As an illustration, let us observe the following three series: Series A 40 40 40 40 40 Series B 35 39 41 42 43 Series C 10 18 35 57 80 We observe that in the first series, all the values are 40. Thus, the A.M. which is 40, fully represents the series as well as the individual items. In the second series, although the mean is 40, the values are not scattered much as the minimum value in the series is 35 and the maximum is 43. Thus, for the second series too, we can say that the mean is a good representative of the series. 41 Here, the discrepancy between the mean and other values isn't very high. In case of the third series, we observe that all the values are very different. Though the mean is same like the other two series, i.e., 40, the values are widely scattered, with 10 being the minimum value and 80 being the maximum. Clearly, in this case, the mean neither satisfactorily represents the series generally, nor the individual values of the series in particular. Thus, measures of central tendency lose their effectiveness and cannot be the representative when the extent of variation (dispersion or scatteredness) of the individual values of a distribution in relation to their average or in relation to the other values becomes large. Hence it is important for a statistician to not only know the average of any type, but also the scatteredness of a distribution. Scatteredness of data about an average is termed as dispersion or variation. Quoting Spiegel, "The degree to which numerical data tend to spread about an average value is called variation or dispersion." The study of average alone without knowledge of dispersion may lead to an erroneous conclusion. A person with the knowledge of average and without the knowledge of variability was once travelling with his family where he had to cross a river without a boat to reach the destination. He knew that the average depth of the river was 100cm and the average height of his family was 130cm. So he and his family decided to swim across the river on foot. But it so happened that the maximum depth of the river was 150cm and the height of his youngest son was 60cm. We can only fathom what must have happened to the family. Depending upon the nature of a distribution, different methods of obtaining the representative value have been evolved and as a result, we have several averages or measures of central tendency. 4.2.1 Objectives of measuring variability Following are the objectives of studying variability: ¡ To study the reliability of average: The study of variability helps in assessing the reliability of the average by determining the extent to which the data under study are homogenous ¡ To control variation: The study of variation is done to see if the variation among data is significant and if it so then such study may help in suggesting measures to control the variation. ¡ To make comparison among series: Measures of variability are useful in comparing two or more series with regards to disparity or differences. ¡ To make further statistical analysis: Standard deviation which is a measure of variability is useful for study of higher measures such as skewness, kurtosis, regression, correlation, etc. 4.2.2 Absolute and Relative measures of dispersion The various types of measures of dispersion can be divided into a) Absolute measures b) Relative measures a. Absolute measures: Absolute measures of dispersion are those which are expressed in the same units of the distributions for which these measures are obtained. For instance, if the 42 original distribution is in kilograms, an absolute measure will also be in kilograms. For this reason an absolute measure of dispersion cannot be used to compare the variability between two or more distributions. b. Relative measures: A relative measure of dispersion is one which is calculated as a percentage or coefficient of an absolute measure. A relative measure of dispersion is also known as the coefficient of dispersion. A relative measure of dispersion is free from any unit. 4.3 DIFFERENT MEASURES OF DISPERSION There are two types of measures of dispersion, namely a) Absolute measures and b) Relative measures. i. Absolute measures: There are four types of absolute measures of dispersion: ¡ Range ¡ Interquartile Range and Quartile Deviation (Q.D.) ¡ Mean Deviation (M.D.) ¡ Standard Deviation (S.D.) and variance. ii. Relative measures: These are the following: ¡ Coefficient of Quartile deviation ¡ Coefficient of mean deviation ¡ Coefficient of standard deviation ¡ Coefficient of variation 4.5 RANGE The range in a distribution is the difference between the smallest and the largest value of that distribution. Thus if L denotes the largest observation and S denotes the smallest observation, then Range R = L - S. 4.5.1 Advantages and Limitation of Range: Range is easy to be calculated and understood. Range is used in quality assurance, where the range is used to prepare control charts. However, the disadvantages arise in the form of range being dependent only on the two extreme values. This may, very often, lead us to a wrong conclusion. Also, range cannot be computed in case of one or both the first and last class intervals being open. Example: Determine range for the following distribution: Weight (kgs) 40 47 56 62 70 No. of students 4 7 11 3 1 Solution: Here, L = 70, S = 40 Range = L - S = 70 - 40 = 30kg. 43 4.6 INTERQUARTILE RANGE AND QUARTILE DEVIATION (Q.D.) Interquartile range is the difference between the third quartile (Q3) and the first quartile, Q1 of the distribution. Quartile deviation, Q.D. is half of the interquartile range of a distribution. Thus, Interquartile range = Q3- Q1 Quartile deviation = Q3 - Q1 ) / 2 4.6.1 Advantages: ¡ Q.D. is a better measure of dispersion than range as unlike range that takes into account only two of the values, Q.D. involves 50% of the values. ¡ Q.D. is not affected by extreme values as the lowest 25% and highest 25% of the observations are not considered while calculating the Q.D. ¡ It is the only measure which can be used for open ended class intervals. 4.6.2 Disadvantages: ¡ Since Q.D. is based on only 50% of the observations, it disregards the other half of the observations. ¡ It is not amenable for further mathematical treatment. Interquartile range is specifically useful for those data users who are more interested in values towards the middle and less interested in the extremes. 4.6.3 Co-efficient of Quartile Deviation (Q.D.) Co-efficient of Q.D. is a relative measure of dispersion and is defined as follows: 4.7 MEAN DEVIATION (M.D.) Mean deviation is the third type of measure of dispersion and is the arithmetic mean (A.M.) of the absolute deviations of the observations in a distribution from the average (usually mean or median). For applying mean deviation, variance and standard deviation, the data should be at least an interval level data. 44 Here A = mean or median of x and d = x-A Note: i. In case of mean deviation from mean, 'A' may be AM, GM or HM but is usually taken as AM. ii. |x-A| is called as the absolute value of the deviation x - A. By absolute value we mean the magnitude of the value without considering the sign. iii. Since the sum of the deviations measured from the bus is zero, hence in case of mean deviation we always take absolute deviations. 4.7.1 Advantages and disadvantages of Mean Deviation: Advantages: ¡ The M.D of a distribution is based on all the observations. ¡ It is less affected by extreme values. ¡ Since deviations are taken from average (mean, median, mode), thus, mean deviation is considered to be a good measure for comparing the variability among two or more distributions. Disadvantages: ¡ In case of M.D. absolute values are taken and the actual signs of deviations are discarded. ¡ Mean deviation from mode is not considered to be a good measure of dispersion. ¡ For a grouped frequency distribution containing open- end class intervals, one cannot determine mean deviation. 4.7.2 Coefficient of Mean Deviation: While mean deviation is an absolute measure of dispersion, coefficient of mean deviation is a relative measure of dispersion. The formula is given below: Coefficient of mean deviation = (Mean deviation)/(The average from which mean deviation is taken) Thus, 45 Example: For the following distribution determine the mean deviation (M.D.) from mean and its coefficient. 4.8 STANDARD DEVIATION Standard deviation is a popular measure of variability. It is used as an independent measure of analysis as well as used as a part of other analyses, such as computing confidence intervals and in hypothesis testing. It is the positive square root of the arithmetic mean of the squares of the deviations of the values of a variable from its arithmetic mean. If the variable x takes n values x1, x2,… , xn and if x ? be the arithmetic mean of these values, then 4.8.1 Advantages and limitations of S.D. Advantages: ¡ Standard deviation is considered to be the best measure of dispersion. ¡ It is rigidly defined and is based on all observations. ¡ Standard deviation possesses the highest sampling stability as compared to the other measures of dispersion. ¡ The main disadvantage that is found in case of mean deviation, i.e., it disregards the algebraic signs of deviations, standard deviation faces no such limitation. ¡ The normal curve can also be analysed with the help of S.D. 46 ¡ It is standard deviation which can be considered as the basis of sampling theory and correlation analysis. Limit