Descriptive Statistics Book PDF (Week 1-6)

Document Details

RealisticGraffiti8623

Uploaded by RealisticGraffiti8623

IIT Madras BS

Prashant Sharma

Tags

descriptive statistics data analysis statistics data types

Summary

This book covers descriptive statistics, including data types, frequency distributions, charts, and various descriptive measures. It's aimed at undergraduate students.

Full Transcript

Descriptive Statistics by Prashant Sharma Page 2 About the author Prashant Sharma Prashant Sharma holds a Master’s degree in Statistics with an outstanding academic record from the prestigious University of Delhi. He possesses a solid aca...

Descriptive Statistics by Prashant Sharma Page 2 About the author Prashant Sharma Prashant Sharma holds a Master’s degree in Statistics with an outstanding academic record from the prestigious University of Delhi. He possesses a solid academic foundation in the subject that serves as the bedrock of his writing. He is constantly driven by a keen interest in expanding his knowledge and exploring the vast realm of statistics. As an instructor, Prashant has conducted live sessions for students, delivering statistical concepts in a clear and engaging manner. He actively seeks opportunities to enhance his teaching skills and stay up-to-date with the latest statistical developments. By leveraging his expertise and leveraging online teaching platforms, Prashant strives to empower students of IITM BS to grasp statistical concepts and apply them to their respective fields. Through his writing and sessions, Prashant continues to inspire students and making statistics an approachable and captivating subject for all. Page 3 Contents 1 Statistics 7 1.1 Population and Sample.............................. 7 1.2 Major branches of statistics........................... 7 1.3 Purpose of statistical analysis.......................... 8 2 Data 9 2.1 Unstructured and Structured Data....................... 9 2.1.1 Variables and Cases............................ 11 2.2 Classification of Data............................... 11 2.2.1 Categorical Data and Numerical Data................. 11 2.2.1.1 Categorical Data........................ 11 2.2.1.2 Numerical Data......................... 12 2.2.2 Time-series and cross-sectional Data.................. 12 2.2.3 Scales of measurement.......................... 12 2.2.3.1 Nominal scale of measurement................. 12 2.2.3.2 Ordinal scale of measurement................. 13 2.2.3.3 Interval scale of measurement................. 13 2.2.3.4 Ratio scale of measurement.................. 13 3 Describing categorical data: Frequency distribution 17 3.1 Frequency Distribution.............................. 17 3.2 Relative frequency................................ 18 3.3 Charts of categorical data............................ 18 3.3.1 Pie Chart................................. 18 3.3.2 Bar Chart................................. 19 3.3.3 Pareto Chart............................... 20 3.4 The Area Principle................................ 24 3.4.1 Misleading graphs: violating area principle............... 24 3.4.2 Misleading graphs: truncated graphs.................. 26 3.4.3 Manipulated y-axis............................ 27 3.4.4 Indicating a y-axis break......................... 28 3.4.5 Round-off errors.............................. 28 3.5 Summarizing Categorical Data.......................... 29 3.5.1 Mode.................................... 29 3.5.1.1 Bimodal and Multimodal data................ 31 3.5.2 Median.................................. 32 4 Describing Numerical data 35 4.1 Types of variables................................. 35 4.1.1 Discrete Variable............................. 35 4.1.2 Continuous Variable........................... 35 Page 4 4.2 Organizing Numerical Data........................... 35 4.2.1 Organizing Discrete Data (single value)................. 36 4.2.2 Organizing Continuous Data....................... 37 4.2.2.1 Terminology........................... 37 4.3 Stem-and-leaf diagram.............................. 38 4.3.1 Steps to construct a stemplot...................... 38 4.4 Descriptive Measures............................... 39 4.4.1 Measures of Central Tendency...................... 39 4.4.1.1 Mean.............................. 39 4.4.1.2 Median............................. 43 4.4.1.3 Mode.............................. 44 4.4.2 Measures of Dispersion.......................... 45 4.4.2.1 Range.............................. 46 4.4.2.2 Variance............................. 46 4.4.2.3 Standard Deviation....................... 49 4.5 Percentiles..................................... 53 4.5.1 Computing Percentiles.......................... 53 4.6 Quartiles...................................... 54 4.7 Five Number Summary.............................. 55 4.8 Interquartile Range (IQR)............................ 55 5 Association between two variables 56 5.1 Association Between Two Categorical Variables................ 56 5.1.1 Stacked Bar Chart............................ 61 5.1.2 100% Stacked Bar Chart......................... 61 5.2 Association Between Two Numerical Variables................. 64 5.2.1 Scatter Plot................................ 64 5.2.1.1 Describing Association..................... 67 5.2.2 Measures of association between two numerical variables....... 69 5.2.2.1 Covariance........................... 69 5.2.2.2 Correlation........................... 71 5.2.2.3 Fitting a line.......................... 72 5.3 Association Between Categorical and Numerical Variables........... 72 5.3.1 Point Bi-serial Correlation Coefficient.................. 72 6 Basic Principle of Counting 76 6.1 Introduction.................................... 76 6.1.1 Addition rule of counting......................... 76 6.1.2 Multiplication rule of counting...................... 76 6.1.2.1 Solved Examples:........................ 77 6.1.3 Unsolved Problems:............................ 79 Page 5 7 Factorial 80 7.1 Definition..................................... 80 7.1.0.1 Simplifying expressions:.................... 80 7.1.0.2 Unsolved Problems:...................... 81 8 Permutation 82 8.1 Definition..................................... 82 8.1.0.1 Solved Examples:........................ 82 8.2 Permutation formula............................... 84 8.2.1 When repetition is not allowed...................... 84 8.2.1.1 Solved examples by using permutation formula:....... 84 8.2.1.2 Example: Application..................... 86 8.3 Permutation formula............................... 87 8.3.1 When repetition is allowed........................ 87 8.3.1.1 Solved examples:........................ 88 8.4 Permutation formula............................... 89 8.4.1 Rearranging letters............................ 89 8.5 Circular Permutation.............................. 90 8.5.1 Clockwise and anticlockwise are different................ 90 8.5.2 Clockwise and anticlockwise are same.................. 91 8.5.2.1 Examples of calculating n and r................ 92 8.6 Unsolved Problems:................................ 94 9 Combination 95 9.1 Definition..................................... 95 9.1.0.1 Solved Examples:........................ 95 9.1.1 Drawing lines in a circle......................... 97 9.1.1.1 Some more solved examples on permutation and combination: 98 9.2 Unsolved Problems:................................ 100 Page 6 Chapter 1 1 Statistics Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions. 1.1 Population and Sample Population The total collection of all the elements that we are interested in is called a population. Sample A subgroup of the population that will be studied in detail is called a sample. We can understand about sample and population from the following picture: Example: Suppose a survey is conducted to know the prices of all houses in Tamil Nadu and 1000 houses were randomly selected from the urban areas of Tamil Nadu for this study. It is con- cluded that price of a house per square feet is roughly 5680 Rs. Then, the sample consists of the selected 1000 houses from the urban areas of Tamil Nadu and the population consists of all houses in Tamil Nadu. 1.2 Major branches of statistics 1. Descriptive Statistics Statistics The part of statistics concerned with the description and summarization of data is called descriptive statistics. Summarization of data means numerical/graphical summary of data or to describe the main points of data. A descriptive study may be performed either on a sample or on a population data. Page 7 2. Inferential Statistics Statistics The part of statistics concerned with drawing conclusions from the data is called inferential statistics. 1.3 Purpose of statistical analysis If the purpose of the analysis is to examine and explore information about the collected data only, then the study is descriptive. For Example: A class of 50 students gave an exam (of 100 marks) and the average marks of the class is calculated as 65. This type of study is called descriptive statistics because here we are just summarizing the data (calculating the average marks of whole class). If the information is obtained from a sample of a population and the purpose of the study is to use that information to draw conclusions/inferences about the population, the study is inferential. For Example: A teacher wants to know the average marks of all students in the school. Since there is a large number of students in the school, the teacher collects a sample of students from the school and calculates the average marks of the selected students which is, say, 60 marks. Then, teacher made the conclusion (using statistical tech- niques) that average marks of all students in the school is 60. This type of study is called inferential statistics because here we are making conclusion about population based on the sample data. Page 8 Chapter 2 2 Data Definition Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. Purpose to collect the data : Generally, we collect the data when we are interested to understand the characteristics or attributes of some group or groups of people, places, things, or events. For Example: (1) To know about temperatures in a particular month in Chennai, India. (2) To know about the marks obtained by students in their Class X. 2.1 Unstructured and Structured Data Unstructured Data Unstructured data is a dataset that is not organized in a predefined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Also, unstructured data requires more work to process and understand. For Example: You-tube comments, Image files, Social-media posts, lyrics of a song etc. When data are scattered with no structure, i.e., not in any standard format, the infor- mation is of very little use. Structured Data Structured data is a standardized format for providing information about a dataset and it is clearly defined and searchable, as for the information in a dataset to be useful, we must know the context of the numbers and text it holds. Also, structured data is easy to analyze and understand. Hence, we need to organize the data. Let’s consider the following two examples: Page 9 (1) Dataset of students: Name Gender Date of Birth Marks in class 10th Board Anjali F 17 Feb, 2003 484 State Board Pradeep M 3 June, 2002 514 ICSE Divya F 22 Mar, 2003 397 State Board Sarita F 19 May, 2002 533 ICSE Harsha M 4 March, 2002 436 CBSE Bhavana F 7 Apr, 2003 526 State Board Rohit M 4 March, 2002 378 CBSE Vikash M 11 Oct, 2001 526 CBSE Table 1: Student dataset The student dataset shown in Table 1 can be considered as structured data because this data is in a tabular form and provides the information about Gender, Date of Birth, Marks in 10th class and Board of the students. Also, this data is easy to analyze and understand as we can easily get information about any student e.g. Anjali has scored 484 marks in class 10th of State board, Pradeep is Male and have date of birth as 3rd June, 2002 etc. (2) Dataset of fertilizers: Fertilizers Types of Fertilizers Area of fields Types of Crops Amount of fertilizers (In acres) (In Kg) Nitrogen Inorganic 1 Rice 200 Phosphorus Inorganic 2 Wheat 400 Manure Organic 1.5 Potato 300 Compost Organic 1.3 Rice 260 Potassium Inorganic 1.6 Pulse 320 Table 2 : Fertilizers dataset Fertilizers dataset shown in table 2 can also be considered as structured data because this data is in a tabular form and provides the information about fertilizers. Also, this data is easy to analyze and understand as we can easily get information e.g. Potassium is an inorganic fertilizer and can be used for pulse in the amount of 320 Kg etc. Page 10 2.1.1 Variables and Cases Case (observation) : A case/observation is a unit for which data is collected. Cases should uniquely identify each row in the dataset. Variable : A variable is a characteristic or attribute that varies across all units. Intuitively, a variable is that “varies”. For Example: In the table 1 of student dataset, each student, i.e., “Anjali, Pradeep, Divya etc.” are cases as data is collected for every student and all the names uniquely identify each row in the dataset. And, variables are “Name, Gender, Date of Birth, Board etc., as their values keeps on varying.” Note: The student dataset is in tabular form. If we want to organise a data in a tabular form, then following two points should take into consideration: Rows represent cases: For each case, same attribute is recorded. Columns represent variables: For each variable, same type of value for each case is recorded. 2.2 Classification of Data Data is broadly classified into two categories; categorical data and numerical data. 2.2.1 Categorical Data and Numerical Data 2.2.1.1 Categorical Data Categorical data are also called qualitative variables and it identifies the group membership. Also, we cannot perform any meaningful mathematical operations on it. In the student dataset which is illustrated in Table 1, Gender is a categorical variable because it has two categories as F and M. We can classify any observation into one of these two categories. Page 11 Also, Board is a categorical variable since it has three categories as State Board, ICSE and CBSE and any observation can be categorized into one of these three groups. 2.2.1.2 Numerical Data Numerical data are also called quantitative variables. It describes the numerical properties of the data, i.e., we can perform mathematical operations on the data. In the student dataset of table 1, Marks is a numerical variable because we can describe the numerical properties of data as marks of Rohit is 378, marks of Pradeep is 514 or marks of Bhavana is more than marks of Harsha etc. Measurement units Scale defines the meaning of numerical data, such as weights measured in kilograms, prices in rupees, heights in centimeters, etc. Also, the data that make up a numerical variable in a data table must share a common unit. 2.2.2 Time-series and cross-sectional Data If the data is recorded over a period of time, then it is called time-series data. Also, graph of a time series showing values in chronological order is known as Time-plot. Example: The data collected to observe the temperature in Delhi for seven different days is a time-series data. Because, data is recorded only for one place (i.e. Delhi) and it is recorded over a period of time (i.e. seven different days). If the data is observed at the same time, then it is called cross-sectional data. Example: The data collected to observe the temperature of Delhi, Chennai, Jaipur and Bhopal on a particular day is a cross-sectional data. Because, data is recorded at the same time and it is observed for several places. 2.2.3 Scales of measurement We have four scales of measurement called nominal, ordinal, interval and ratio scale. Data collection requires any one of the scales of measurement. 2.2.3.1 Nominal scale of measurement When the data for a variable consist of labels or names used to identify the characteristic of an observation, the scale of measurement is considered a nominal scale. Example: Name, Board, Gender, Blood group etc. Note: Sometimes nominal variables might be numerically coded like we might code men as 1 and women as 2 or code men as 3 and women as 1. There is no ordering in the variable. Page 12 In short “ Nominal scale is just categories or labels which does not contain any order.” 2.2.3.2 Ordinal scale of measurement When data exhibits properties of nominal data and the order or rank of data is meaningful, the scale of measurement is considered an ordinal scale. Example: Each customer who visits a restaurant provides a service rating of excellent, good, or poor. Here, the data obtained are the labels as excellent, good, or poor, i.e., the data have the properties of nominal data. Also, the data can be ranked/ordered, with respect to the service quality. Note: We can code an ordinal scale of measurement, as bad can be coded as 1, good can be coded as 2 and excellent can be coded as 3. There is an order in 1, 2, 3 but one thing need to understand is the distance between bad and good need not be same as the distance between good and excellent. It is just an order. As we know excellent is better than good, but we cannot say that the difference between good and excellent is the same as the difference between good and bad. Thus, we have just an order. In short “ Ordinal scale is just categories or labels which contain an order.” 2.2.3.3 Interval scale of measurement If the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure, then the scale of measurement is interval scale. Note: Data with interval scale of measurement are always numeric and we can find out the difference between any two values. Ratios of values have no meaning here because the value of zero is arbitrary. Example: Consider an AC room where temperature is set at 20°C and the temperature outside the room is 40°C. It is correct to say that the difference in temperature is 20°C, but it is incorrect to say that the outdoor is twice as hot as indoor. Also, temperature in degrees Fahrenheit or degrees centigrade has an interval scale of mea- surement, because it has no absolute zero. In the Celsius scale, 0 and 100 are set to be as the freezing point and the boiling point whereas, in Fahrenheit it is 32 and 212. 2.2.3.4 Ratio scale of measurement If the data have all the properties of interval data and the ratio of two values is meaningful, then the scale of measurement is ratio scale. Ratio scale of measurement has absolute zero property which is the key difference between Page 13 ratio and interval scale. Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like height, weight and marks can be added, subtracted and multiplied or divided as it all have absolute zero property. A summary about all scales of measurement can be described as follows : Page 14 Unsolved Problems (1) An analyst wants to conduct a survey for testing the maintenance of hospitals in a particular district in Bihar, for which he selects 25 hospitals randomly from that district. Identify the sample and population. [2 Marks] (a) The population is all the hospitals in Bihar and the sample is all the hospitals in the district. (b) The population is all the hospitals in Bihar and the sample is 25 selected hospitals in Bihar. (c) The population is all hospitals in the district of Bihar and the sample is 25 selected hospitals in the district. (d) None of the above Answer: c (2) In the 2011 Cricket ODI World Cup quarter-final match between India and Australia, a media organization estimated that Australia would beat India by 50 runs if Australia bats first, based on the information of matches played between the two teams previously. Which branch of statistics does the above analysis belong to? Answer: Inferential Statistics (3) Values of temperature and humidity of a room are measured for 24 hours at a regular time interval of 30 minutes. Based on this information, choose the correct option: (a) It is a cross-sectional data. (b) It is time-series data. Answer: b (4) What kind of data is “Social media posts”? (a) Unstructured data (b) Structured data Answer: a (5) What kind of variable is the qualification of a candidate sitting for a job interview? (a) Numerical/ Quantitative (b) Categorical/ Qualitative (c) Numerical and discrete (d) Numerical and continuous Answer : b Page 15 (6) If addition, subtraction can be performed on a variable, then the scale(s) of measurement of the variable could be: (a) Ordinal (b) Ratio (c) Interval (d) Nominal Answer : b, c (7) Which of the following variable(s) have nominal scale of measurement? (a) Education qualification of a person. (b) Hair color (c) Brand name of mobile phone (d) Number plate of cars Answer: b, c, d Page 16 Chapter 3 3 Describing categorical data: Frequency distribution 3.1 Frequency Distribution A frequency distribution of qualitative data is a listing of the distinct values and their frequencies. Each row of a frequency table lists a category along with the number of cases in this category. Example: Let’s construct a frequency table for the following data. (1) A, A, B, C, A, D, A, B, D, C Category Tally mark Frequency A 4 B 2 C 2 D 2 Total 10 (2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A Category Tally mark Frequency A 6 B 3 C 3 D 3 Total 15 (3) A, B, B, C, A, D, B, B, D, C, A, B, C, D, B Category Tally mark Frequency A 3 B 6 C 3 D 3 Total 15 (4) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D Page 17 Category Tally mark Frequency A 6 B 3 C 4 D 5 Total 18 3.2 Relative frequency The ratio of the frequency to the total number of observations is called relative frequency. Note: Relative frequency plays an important role for comparing two data sets because relative frequencies always fall between 0 and 1, they provide a standard for comparison. Examples: Let us find the relative frequencies for the following data. (1) A, A, B, C, A, D, A, B, D, C Category Frequency Relative Frequency A 4 0.4 B 2 0.2 C 2 0.2 D 2 0.2 Total 10 1 (2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A Category Frequency Relative Frequency A 6 0.4 B 3 0.2 C 3 0.2 D 3 0.2 Total 15 1 3.3 Charts of categorical data The two most common displays of a categorical variable are a bar chart and a pie chart. 3.3.1 Pie Chart A pie chart is a circle divided into pieces proportional to the relative frequencies of the qualitative data and it is used to show the proportions of a categorical variable. And, a pie chart is a good way to show that one category makes up more than half of the total. Page 18 Example: Consider the frequency table of the dataset A, A, B, C, A, D, A, B, D, C. Category Frequency Relative Frequency A 4 0.4 B 2 0.2 C 2 0.2 D 2 0.2 Total 10 1 Table 3.1 Figure 3.1 is the pie chart representation of the dataset in Table 3.1: Figure 3.1: Pie chart representation As pie chart gives us the share of a pie, share of category A is 40%, category B is 20%, category C is 20% and category D is 20%. 3.3.2 Bar Chart A bar chart displays the distinct values of the qualitative data on a horizontal axis and the relative frequencies (or frequencies or percents) of those values on a vertical axis. The fre- quency/relative frequency of each distinct value is represented by a vertical bar whose height is equal to the frequency/relative frequency of that value. The bars should be positioned so that they do not touch each other. Bar chart is most appropriate to represent the count of a particular category and it can be oriented either horizontally or vertically. Page 19 Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D Category Frequency Relative frequency A 6 0.33 B 3 0.17 C 4 0.22 D 5 0.28 Total 18 1 Table 3.2 Figure 3.2 represents the bar chart of the dataset in Table 3.2 as follows: Figure 3.2: Bar chart representation 3.3.3 Pareto Chart When the categories in a bar chart are sorted by frequency, the bar chart is sometimes called a Pareto chart. Pareto charts are popular in quality control to identify problems in a business process. Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D Page 20 Category Frequency Relative frequency A 6 0.33 B 3 0.17 C 4 0.22 D 5 0.28 Total 18 1 Table 3.3 Figure 3.3 is the pareto chart representation of the dataset in Table 3.3 as follows: Figure 3.3: Pareto chart representation Note: If the categorical variable is ordinal, then the bar chart must preserve the ordering. For example: The T-shirt sizes L, M, M, S, L, S, S, M, L, M, M, S, S, L, M, S, M, S, L, M of twenty students is listed in Table 3.4: Size Frequency Relative frequency Small 7 0.35 Medium 8 0.40 Large 5 0.25 Total 20 1 Table 3.4 Page 21 Dataset of Table 3.4 is ordinal. So, we have preserved the order of the data. And, bar chart representation for the dataset of Table 3.4 is given as follows: Figure 3.4: Bar chart of Ordinal data Purpose of using charts (1) Pie charts are best to use when we are trying to compare parts of a whole. (2) Bar graphs are used to compare things between different groups. Many Categories: A bar chart or pie chart with too many categories might conceal the more important cate- gories. In some case, grouping other categories together might be done. Now, let’s consider the following bar chart with too many categories: Page 22 Now, we can do grouping of other categories together as follows: Grouping other categories together in a major category conveys two important things. (1) We are not excluding any data. (2) We have a significant number that comes from smaller categories. Page 23 3.4 The Area Principle The area principle says that the area occupied by a part of the graph should correspond to the amount of data it represents. Display of data must obey the rule of area principle and violations of the area principle are a common way to mislead with statistics. 3.4.1 Misleading graphs: violating area principle (1) Decorated graphs: Sometimes charts are decorated to attract attention which often vio- late the area principle. For Example: Figure 3.5 is an example of decorated graph: Figure 3.5: Decorated graph Figure 3.5 gives us the total wine exports in UK, Canada, Japan and Italy. But, there is no baseline and the chart shows bottles on top of labeled boxes of various sizes and shapes. Page 24 Now, Figure 3.6 represents the chart which is not decorated: Figure 3.6 We have labeled each one of the categories. It is accurate and it has a baseline. This chart is actually consistent and the width of the bars for each countries are equal. Also, the area occupied by the graph is proportional to the data that is being presented. (2) Violation of area principle in a pie chart Figure 3.7 represents the pie chart of the sales distribution of mobile phones of different company. Figure 3.7 Page 25 The pie chart of the Figure 3.7 is violating the area principle as areas occupied by sales distribution of HTC and Apple do not correspond to the amount of data it represent. 3.4.2 Misleading graphs: truncated graphs Another common violation is when the baseline of a bar chart is not at zero. (1) Consider the following two bar chart: Left graph exaggerates the number as it is not at zero. But, the graph on right side shows same data with the baseline at zero. (2) The following figure represents the share of votes in an election in USA. From the length of the bar we observe that Republic party voting percentage is less than half of the Democratic party but if we consider the actual number this is not the case. Page 26 3.4.3 Manipulated y-axis Expanding or compressing the scale on a graph that can make changes in the data seem less significant than they actually are, is known as the manipulation of y-axis. For example: Following bar charts represent the number of sales of smart phone A and B of a local shop. Figure : 3.8 Figure : 3.9 Page 27 From the figure 3.8 we are getting the information that a significant amount of sales is being done of both the smart phones but from the figure 3.9 it seems that the sales is very low of the smart phone A and B. So, the graph in figure 3.9 is misleading because it has manipulated y-axis. 3.4.4 Indicating a y-axis break We can indicate a y-axis break in a bar chart in the following way: Figure : 3.10 3.4.5 Round-off errors It is important to check for round-off errors. Round-off errors occur when table entries are percentages or proportions, the value of total sum may slightly differ from 100% or 1. This might result in a pie chart. For Example: Consider the following table: Category Percentage A 22.3 B 35.6 C 12.6 D 11 E 18.5 Total 100 Page 28 In the table, the value of total sum is 100%. Suppose, we round off the values and draw a pie chart as follows: In this pie chart has round-off errors because total sum of all entries is 100.5% which is different from 100%. 3.5 Summarizing Categorical Data Bar chart and Pie chart are graphical summaries of categorical data. Numbers that are used to describe data sets are called descriptive measures. Descriptive measures that indicate where the center or most typical value of a data set lies are called measures of central tendency. 3.5.1 Mode The mode of a categorical variable is the most common category, the category with the highest frequency. Mode labels the longest bar in a bar chart, the widest slice in a pie chart and the first category shown in a Pareto chart. Example: Let’s consider the dataset A, A, B, C, A, D, A, B, C, C, A, B, C, D, A. Here, category A is the mode of the data as it occurs with the highest frequency. Now, figure 3.11, 3.12 and 3.13 represent the bar chart, pie chart and pareto chart for the dataset as follows: Page 29 (1) Bar chart representation for the above dataset is: Figure : 3.11 In the figure 3.11, category A has the longest bar. Thus, mode of the dataset is category “A”. (2) Pie chart representation of the above dataset is: Figure : 3.12 In the above pie chart, category A has the widest slice. Thus, mode of the dataset is category “A”. Page 30 (3) Pareto chart for the above dataset is: Figure : 3.13 In the above pareto chart, first bar is for category A. Thus, mode of the dataset is category “A”. 3.5.1.1 Bimodal and Multimodal data If two or more categories tie for the highest frequency, the data is called bimodal (in the case of two) or multimodal (more than two). Example: Let’s consider the dataset A, A, B, C, A, C, A, B, C, C, A, C, C, D, A, A, C, D, B. Here both categories “A” and “C” have highest frequency. Thus, this data is bimodal. Now, we can consider the following bar chart also. In the above bar chart, both categories “A” and “C” have highest frequency. Page 31 3.5.2 Median The median of an ordinal variable is the category of the middle observation of the sorted values. If there are an even number of observations, then we can choose the category on either side of the middle of the sorted list as the median. Examples: (1) When number of observations is odd: Let’s consider the grades of 15 students as A, B, B, C, A, D, B, B, A, C, B, B, C, D, A. Now to find the median of the categorical data, we need to order the data. So, the ordered data is A, A, A, A, B, B, B, B, B, B, C, C, C, D, D. Hence, the median grade is the category associated with the 8th observation which is “B”. (2) When number of observations is even: Let’s consider the grades of 14 students which is listed as A, B, B, C, A, D, B, B, A, C, B, B, C, D. Now, the ordered data is A, A, A, B, B, B, B, B, B, C, C, C, D, D. The median grade is the category associated with the 7th or 8th observation which is “B”. In the example (1), mode of the dataset is also category “B”. Here, mode and median both are same. (3) Consider the grades of 15 students which is listed as A, B, B, C, A, D, A, B, A, C, B, A, C, D, A. The ordered data is A, A, A, A, A, A, B, B, B, B, C, C, C, D, D. The median grade is the category associated with the 8th observation which is “B”. The most common grade is “A”, hence mode is “A”. In this example both mode and median are the different. Note: Median can be defined only for ordinal data whereas mode can be defined for both nominal as well as ordinal data. Page 32 Unsolved Problems (1) If an analyst wants to represent the revenues of various companies using graphs, then which of the following graphical representation/s is/are most appropriate for the pur- pose?(More than one option can be correct) (a) A pie chart with a pie/slice for each company and the width corresponding to its revenue in crore rupees. (b) A bar chart with a bar for each company on the x-axis and the length corresponding to its revenue in crore rupees on the y-axis. (c) A bar chart with a bar for each company on the y-axis and the length corresponding to its revenue in crore rupees on the x-axis. (d) A bar chart with the minimum revenue as a baseline. Answer: b, c (2) Mode of a categorical variable is:(More than one option can be correct) (a) The last bar in ascending order of a Pareto chart. (b) The middle-most bar in a Pareto chart. (c) The longest bar in a bar chart. (d) The widest slice in a pie chart. Answer: a, c, d (3) Which of the following can be defined for both nominal and ordinal data? (a) Mean (b) Median (c) Mode (d) All of the above Answer: c A total of 2000 cases of Covid-19 have been registered on 5th May 2020 in 5 key districts of Maharashtra. The proportion (out of 5 districts) of cases in each district has been listed in Table 2.1.A. Based on the information given, answer questions (4) and (5). District Relative Frequency Mumbai 0.35 Pune 0.20 Nagpur x Thane 0.25 Nashik 0.08 Page 33 (4) Find the relative frequency of district Nagpur. Answer: 0.12 (5) How many cases were registered in Pune on 5th May? Answer: 400 Page 34 Chapter 4 4 Describing Numerical data 4.1 Types of variables 4.1.1 Discrete Variable A discrete variable usually involves a count of something. For example: Number of people in a household, Number of spelling mistakes in a report, Number of accidents in a month in a particular city etc. 4.1.2 Continuous Variable A continuous variable usually involves a measurement of something. For example: Weight of person, Height of a person, Speed of a vehicle etc. 4.2 Organizing Numerical Data We can do the following procedures for organizing the numerical data. (1) Group the observations into classes (also known as categories or bins) and then treat the classes as the distinct values of quantitative data. (2) Once we group the quantitative data into classes, we can construct frequency and relative-frequency distributions of the data. Page 35 4.2.1 Organizing Discrete Data (single value) We can proceed in the following ways for organizing the discrete data. (1) If the data set contains only a relatively small number of distinct, or different, values, it is convenient to represent it in a frequency table. (2) Each class represents a distinct value (single value) along with its frequency of occurrence. For Example: Suppose the dataset reports the number of people in a household and data of the response from 15 individuals is 2, 1, 3, 4, 5, 2, 3, 3, 3, 4, 4, 1, 2, 3, 4. The distinct values the variable, number of people in each household, takes is 1, 2, 3, 4, 5. The frequency distribution table is: Value Frequency Relative frequency 1 2 0.13 2 3 0.2 3 5 0.33 4 4 0.27 5 1 0.07 Total 15 1 Here each value is considered as a category. Now, let’s consider the graph of the above data: Since values are distinct, therefore we can’t connect the bars. And, in the graph, we have just listed out about height of each bar. Page 36 4.2.2 Organizing Continuous Data Organize the data into a number of classes to make the data understandable. However, there are few guidelines that need to be followed. (1) Number of classes: The appropriate number is a subjective choice, the rule of thumb is to have between 5 and 20 classes. (2) Each observation should belong to some class and no observation should belong to more than one class. (3) It is common, although not essential, to choose class intervals of equal length. 4.2.2.1 Terminology (1) Lower class limit: The smallest value that could go in a class. (2) Upper class limit: The largest value that could go in a class. (3) Class width: The difference between the lower limit of a class and the lower limit of the next-higher class. (4) Class mark: The average of the two class limits of a class. (5) A class interval contains its left-end but not its right-end boundary point. Example: Consider the marks obtained by 50 students in a particular course which are as follows: 68, 79, 38, 68, 35, 70, 61, 47, 58, 66, 60, 45, 61, 60, 59, 45, 39, 80, 59, 62, 49, 76, 54, 60, 53, 55, 62, 58, 67, 55, 86, 56, 63, 64, 67, 50, 51, 78, 56, 62, 57, 54, 69, 58, 52, 42, 66, 42, 56, 58. Frequency table for the above dataset is: Class Interval Frequency Relative frequency 30 − 40 3 0.06 40 − 50 6 0.12 50 − 60 18 0.36 60 − 70 17 0.34 70 − 80 4 0.08 80 − 90 2 0.04 Total 50 1 Graph for the above dataset is: Page 37 4.3 Stem-and-leaf diagram In a stem-and-leaf diagram (or stemplot), each observation is separated into two parts, namely, a stem-consisting of all but the rightmost digit-and a leaf, the rightmost digit. For Example: If the data are all two-digit numbers, then we could let the stem of a data value be the tens digit and the leaf be the ones digit. The value 75 is expressed as: Stem Leaf 7 5 Here, 7 | 5 represents 75. The two values 75, 78 is expressed as: Stem Leaf 7 5 8 Here, 7 | 5 represents 75. 4.3.1 Steps to construct a stemplot (1) Think of each observation as a stem—consisting of all but the rightmost digit—and a leaf, the rightmost digit. (2) Write the stems from smallest to largest in a vertical column to the left of a vertical rule. Page 38 (3) Write each leaf to the right of the vertical rule in the row that contains the appropriate stem. (4) Arrange the leaves in each row in ascending order. Example: Draw a stem-and-leaf plot for the dataset 15, 22, 29, 36, 31, 23, 45, 10, 25, 28, 48 which are the ages of 11 patients admitted in a certain hospital. Stem-and-leaf plot for the above dataset is : Stem Leaf 1 0 5 2 2 3 5 8 9 3 1 6 4 5 8 Here, 1 | 0 represents 10 years. 4.4 Descriptive Measures Descriptive measures are quantities whose values are determined by the data and can be used to summarize a data set. Types of Descriptive Measures Most commonly used descriptive measures can be categorized as: Measures of central tendency: These are measures that indicate the most typical value or center of a data set. Measures of dispersion: These measures indicate the variability or spread of a dataset. 4.4.1 Measures of Central Tendency 4.4.1.1 Mean The mean of a data set is the sum of the observations divided by the number of observations. And, mean is the most commonly used measure of central tendency. The mean is usually referred to as average. In arithmetic average, we have to divide the sum of the values by the number of values which is another typical value. Mean formula for discrete observations: x1 + x2 +... + xn (1) Sample mean (x̄) = n x1 + x2 +... + xN (2) Population mean (µ) = N Example(1): Page 39 (a) Mean of the observations 2, 12, 5, 7, 6, 7, 3 can be computed as 2 + 12 + 5 + 7 + 6 + 7 + 3 x̄ = =6 7 (b) Mean of the observations 2, 105, 5, 7, 6, 7, 3 can be computed as 2 + 105 + 5 + 7 + 6 + 7 + 3 x̄ = = 19.29 7 (c) Mean of the observations 2, 105, 5, 7, 6, 3 can be computed as 2 + 105 + 5 + 7 + 6 + 3 x̄ = = 21.33. 6 Example(2): Suppose the marks obtained by ten students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. The sample mean is: 68 + 79 + 38 + 68 + 35 + 70 + 61 + 47 + 58 + 66 590 = = 59 10 10 Mean for grouped data: discrete single value data Mean formula for grouped data in case of discrete single value data is: f 1 x 1 + f 2 x2 +... + f n xn x̄ = n Example(3): Let’s consider the dataset 2, 1, 3, 4, 5, 2, 3, 3, 3, 4, 4, 1, 2, 3, 4 which are responses from 15 in- dividuals. Here, we can make the frequency table for the above dataset as follows: Value(xi ) Tally Mark Frequency(fi ) f i xi 1 2 2 2 3 6 3 5 15 4 4 16 5 1 5 Total 15 44 44 Mean = = 2.93 15 Page 40 Mean for grouped data: continuous data Mean formula for grouped data in case of continuous data is: f1 m1 + f2 m2 +... + fn mn x̄ = n where mi , i = 1, 2,... , n, is the mid-point of ith class-interval. Example: Class interval Tally Mark Frequency(fi ) Mid point (mi ) fi mi 30 − 40 3 35 105 40 − 50 6 45 270 50 − 60 18 55 990 60 − 70 17 65 1105 70 − 80 4 75 300 80 − 90 2 85 170 Total 50 2940 2940 By applying the formula, average = = 58.8 which is an approximate not exact value 50 of the mean. Adding a constant Suppose x1 , x2 ,... , xn are observations of a dataset and mean of the dataset is x̄. Let yi = xi + c, where c is a constant, then ȳ = x̄ + c. Now, Pn yi ȳ = i=1 n y1 + y2 +... + yn =⇒ ȳ = n (x1 + c) + (x2 + c) +... + (xn + c) =⇒ ȳ = n (x1 + x2 +... + xn ) + nc =⇒ ȳ = n x1 + x2 +... + xn nc =⇒ ȳ = + n n =⇒ ȳ = x̄ + c Example: Suppose the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66 and average marks is 59. If teacher decided to add 5 marks to each students, then find the mean of new dataset. Page 41 Solution: By the property of adding a constant, mean of the new dataset is ȳ = x̄ + c = 59 + 5 = 64. Also, we can verify as follows: Since, marks of 10 students are 68, 79, 38, 68, 35, 70, 61, 47, 58, 66 and after adding 5 to each observations, new dataset will be 73, 84, 43, 73, 40, 75, 66, 52, 63, 71, therefore mean of the new dataset is: 73 + 84 + 43 + 73 + 40 + 75 + 66 + 52 + 63 + 71 640 ȳ = = = 64 = 59 + 5. 10 10 Multiplying a constant Suppose x1 , x2 ,... , xn are observations of a dataset and mean of the dataset is x̄. Let yi = xi c, where c is a constant, then ȳ = x̄c. Proof: Pn yi ȳ = i=1 n y1 + y2 +... + yn =⇒ ȳ = n (x1 c) + (x2 c) +... + (xn c) =⇒ ȳ = n (x1 + x2 +... + xn )c =⇒ ȳ = n =⇒ ȳ = x̄c Example: Suppose the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66 and average marks is 59.If teacher has decided to scale down each mark by 40%, i.e., each mark is multiplied by 0.4, then find the mean of new dataset. Solution: By the property of adding a constant, mean of the new dataset is ȳ = x̄c = 59 × 0.4 = 23.6. Also, we can verify as follows: Since, marks of 10 students are 68, 79, 38, 68, 35, 70, 61, 47, 58, 66 and after multiplying 0.4 to each observations, new dataset will be 27.2, 31.6, 15.2, 27.2, 14, 28, 24.4, 18.8, 23.2, 26.4, therefore mean of the new dataset is: 27.2 + 31.6 + 15.2 + 27.2 + 14 + 28 + 24.4 + 18.8 + 23.2 + 26.4 236 ȳ = = = 23.6 = 59×0.4. 10 10 Page 42 4.4.1.2 Median The median of a data set is the middle value in its ordered list. In other words, median of a data set is the number that divides the bottom 50% of the data from the top 50%. Steps to obtain median Arrange the data in increasing order. Let n be the total number of observations in the dataset. (1) If the number of observations is odd, then the median is the observation exactly in the  th n+1 middle of the ordered list, i.e., observation. 2 (2) If the number of observations is even, then the median is the mean of the two middle  n th n th observations in the ordered list, i.e., mean of and +1 observation. 2 2 Examples: (1) Compute the median of the dataset 2, 12, 5, 7, 6, 7, 3. Step(1): Arrange the data in increasing order: 2, 3, 5, 6, 7, 7, 12 Step(2): Here, n = 7 which is odd.  th  th n+1 8 So, median of the data will be = = 4th observation. 2 2 Thus, median is 6. (2) Compute the median of the dataset 2, 105, 5, 7, 6, 7, 3. Step(1): Data in increasing order is: 2, 3, 5, 6, 7, 7, 105 Step(2): Here, n = 7 which is odd.  th 7+1 So, median of the data is = 4th observation which is 6. 2 (3) Compute the median of the dataset 2, 105, 5, 7, 6, 3. Step(1): Data in increasing order is: 2, 3, 5, 6, 7, 105 Step(2): Here, n = 6 which is even.  th  th 6 rd 6 So, median of the data will be average of = 3 and +1 = 4th observation. 2 2 (5 + 6) Thus, median is = 5.5. 2 Adding a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi + c, where c is a constant then, new median = old median+c. Example: Let’s again consider the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. Page 43 If teacher has decided to add 5 marks to each student, then find the median of new dataset. Solution: First, arrange the data in ascending order as 35, 38, 47, 58, 61, 66, 68, 68, 70, 79. 61 + 66 Here, we have n = 10. So, Median of the dataset is = 63.5. 2 Now, after adding 5 to each observations, new dataset in ascending order will be 40, 43, 52, 63, 66, 71, 73, 73, 75, 84. 66 + 71 Median of the new dataset is = 68.5 = 63.5 + 5 = old median + 5. 2 Multiplying a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi c, where c is a constant then, new median = old median × c. Example: Let’s again consider the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. If teacher has decided scale down each mark by 40%, i.e., each mark is multiplied by 0.4., then find the median of new dataset. Solution: As we know that median of this dataset is 63.5 from the previous example. After multiplying by 0.4 to each observation, new dataset will be 27.2, 31.6, 15.2, 27.2, 14, 28, 24.4, 18.8, 23.2, 26.4. Ascending order of dataset is 14, 15.2, 18.8, 23.2, 24.4, 26.4, 27.2, 28, 31.6. 24.4 + 26.4 Median of the new dataset is = 25.4 = 0.4 × 63.5= old median × 0.4. 2 Note: “Mean is sensitive to outliers, whereas the median is not sensitive to outliers.” Example: (1) For the dataset 2, 12, 5, 7, 6, 7, 3 2 + 3 + 5 + 6 + 7 + 7 + 12 Mean = = 6. 7 Now, arrange the data in ascending order: 2, 3, 5, 6, 7, 7, 12 Median = 6. (2) For the dataset 2, 117, 5, 7, 6, 7, 3 2 + 3 + 5 + 6 + 7 + 7 + 117 Mean = = 21. 7 Now, arrange the data in ascending order: 2, 3, 5, 6, 7, 7, 117 Median = 6. 4.4.1.3 Mode The mode of a dataset is its most frequently occurring value. Examples: (1) Find the mode of dataset 2, 12, 5, 7, 6, 7, 3. Mode is 7 as it occurs twice. Page 44 (2) Find the mode of dataset 2, 105, 5, 7, 6, 7, 3. Mode is 7. (3) Find the mode of dataset 2, 105, 5, 7, 6, 3. There is no mode for the above dataset as no value occurs more than once. Adding a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi + c, where c is a constant then, new mode = old mode + c. Example: Consider the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. If teacher has decided to add 5 marks to each student, then find the mode of new dataset. Solution: Mode of the old dataset is 68. After adding 5 to each observation, new dataset will be 73, 84, 43, 73, 40, 75, 66, 52, 63, 71. Mode of new dataset is 73 = 68 + 5=old mode + 5. Multiplying a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi c, where c is a constant then, new mode = old mode × c. Example: Consider the marks obtained by 10 students in an exam is 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. If teacher has decided to decided to scale down each mark by 40%, i.e., each mark is multiplied by 0.4, then find the mode of new dataset. Solution: Mode of the old dataset is 68. After multiplying 0.4 to each observation, new dataset will be 27.2, 31.6, 15.2, 27.2, 14, 28, 24.4, 18.8, 23.2, 26.4. Mode of new dataset is 27.2 = 0.4 × 68 = old mode × 0.4. 4.4.2 Measures of Dispersion Measure of dispersion indicates the amount of variation, or spread, in a dataset. These measures also known as measures of variation, or measures of spread. Some of measures of dispersion are: (1) Range (2) Variance (3) Standard Deviation (4) Interquartile range Page 45 4.4.2.1 Range The range of a dataset is the difference between its largest and smallest values. The range of a dataset is given by the formula: Range = M ax − M in Where, M ax and M in represent the maximum and minimum values of dataset respectively. Examples: (1) Find the range of the dataset 3, 3, 3, 3, 3. Solution: Here, maximum value of the dataset is 3 and minimum value is also 3. Therefore, Range = M ax − M in Range = 3 − 3 = 0. (2) Find the range of the dataset 1, 2, 3, 4, 5. Solution: Here, maximum value of the dataset is 5 and minimum value is 1. Therefore, Range = M ax − M in Range = 5 − 1 = 4. Effect of outliers on Range Range is sensitive to outliers as it takes into consideration only the minimum and maximum value of the dataset. For example: (1) Dataset 1 : 1, 2, 3, 4, 5. Range of the dataset 1 = 5 − 1 = 4. (2) Dataset 2: 1, 2, 3, 4, 15. Range of the dataset 2 = 15 − 1 = 14. The above two datasets differ only in one point and this point changes the value of Range significantly. And, this significant change happens because range depends only on the max- imum and minimum value of the dataset. 4.4.2.2 Variance Variance measures the variability of a data set and considers the deviations of the data values from the central value. Since, Range is also a measure of dispersion and it takes into account only minimum and maximum value of the dataset whereas variance takes into account all the observations. Population variance and sample variance The two variances can be computed using the following formulae: Page 46 Pn i=1 (xi − x̄)2 Population variance (σ ) =2 n Pn − x̄)2 i=1 (xi Sample variance (s2 ) = n−1 Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66. (1) Compute population variance of the dataset. Solution: xi xi − x̄ (xi − x̄)2 68 9 81 79 20 400 38 −21 441 68 9 81 35 −24 576 70 11 121 61 2 4 47 −12 144 58 −1 1 P 66 −7 49 (xi − x̄)2 = 1898 P xi = 590 590 Mean (x̄) = = 59 10 Pn i=1 (xi − x̄)2 1898 Population variance = = = 189.8 n 10 (2) Compute the sample variance of the dataset. Pn (xi − x̄)2 1898 sample variance = i=1 = = 210.89 n−1 9 Adding a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi + c, where c is a constant then, new variance = old variance. Proof: 2 2 Let population variance of old dataset x1 , x2 ,... , xn is σold and for new dataset is σnew. Now, Pn 2 (yi − ȳ)2 σnew = i=1 n Page 47 On substituting the values of yi = xi + c and ȳ = x̄ + c in the above equation, we get Pn 2 (xi + c − (x̄ + c))2 σnew = i=1 n Pn (xi + c − x̄ − c)2 = i=1 n Pn (xi − x̄)2 = i=1 n 2 = σold Hence, there is no change in the variance of new dataset on adding a constant to each observations of old dataset, i.e., new variance = old variance. For example: Consider the dataset in the example 1 of population variance, if we add 4 to each observations then population variance of new dataset will be: xi y i = xi + 4 yi − ȳ (yi − ȳ)2 68 72 9 81 79 83 20 400 38 42 −21 441 68 72 9 81 35 39 −24 576 70 74 11 121 61 65 2 4 47 51 −12 144 58 62 −1 1 P 66 P 70 −7 49 (yi − ȳ)2 = 1898 P xi = 590 yi = 630 630 Mean (ȳ) = = 63 10 Pn i=1 (yi − ȳ)2 1898 Population variance = = = 189.8, which is same as variance of old n 10 dataset. Multiplying a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi × c, where c is a constant then, new variance = c2 × old variance. Proof: 2 2 Let population variance of old dataset x1 , x2 ,... , xn is σold and for new dataset is σnew. Now, Pn 2 (yi − ȳ)2 σnew = i=1 n Page 48 On substituting the values of yi = xi × c and ȳ = x̄ × c in the above equation, we get Pn 2 (cxi − cx̄)2 σnew = i=1 n Pn (c(xi − x̄))2 = i=1 n 2 Pn 2 c i=1 (xi − x̄) = n 2 2 = c × σold For example: Consider the dataset in the example 1 of population variance, if we multiplied by 0.5 to each observations then population variance of new dataset will be: xi yi = 0.5 × xi yi − ȳ (yi − ȳ)2 68 34 4.5 20.25 79 39.5 10 100 38 19 −10.5 110.25 68 34 4.5 20.25 35 17.5 −12 144 70 35 5.5 30.25 61 30.5 1 1 47 23.5 −6 36 58 29 −0.5 0.25 P 66 P 33 3.5 12.25 (yi − ȳ)2 = 474.5 P xi = 590 yi = 295 295 Mean (ȳ) = = 29.5 10 Pn (yi − ȳ)2 474.5 Population variance = i=1 = = 47.45 = 0.52 × 189.8 n 10 4.4.2.3 Standard Deviation Standard deviation is also the measure of dispersion and it is square root of the variance. Formulas of standard deviation The population standard deviation and sample standard deviation can be computed by using the following formulae: r Pn 2 i=1 (xi − x̄) Population standard deviation (σ) = n r Pn 2 i=1 (xi − x̄) Sample standard deviation (s) = n−1 Page 49 Examples: (1) Consider the dataset in the example 1 of variance and value of population standard deviation can be computed as follows: xi xi − x̄ (xi − x̄)2 68 9 81 79 20 400 38 −21 441 68 9 81 35 −24 576 70 11 121 61 2 4 47 −12 144 58 −1 1 P 66 −7 49 (xi − x̄)2 = 1898 P xi = 590 590 Mean (x̄) = = 59 10 Pn i=1 (xi − x̄)2 1898 Population variance = = = 189.8 n 10 √ So, population standard deviation = 189.8 = 13.78. (2) In the example 2 of variance, value of sample √ variance is computed as 210.89. So the sample standard deviation will be 210.89 = 14.52. Units of standard deviation Variance is expressed in units of square units as units of original variable while standard deviation is expressed in the same units as original data. For Examples: (1) If we have a dataset of weights of 10 students which is measured in kg, then the unit of variance will be (kg)2 and units of standard deviation will be kg. (2) If we have a dataset of age of 10 students which is measured in year, then the unit of variance will be (year)2 and units of standard deviation will be year. Adding a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi + c, where c is a constant then, new standard deviation = old standard deviation. Proof: Let population standard deviation of old dataset x1 , x2 ,... , xn is σold and for new dataset is σnew. Page 50 Now, r Pn i=1 (yi − ȳ)2 σnew = n On substituting the values of yi = xi + c and ȳ = x̄ + c in the above equation, we get r Pn 2 i=1 (xi + c − (x̄ + c)) σnew = n r Pn 2 i=1 (xi + c − x̄ − c) = n r Pn 2 i=1 (xi − x̄) = n = σold Similarly, we can prove for sample standard deviation. Hence, there is no change in the standard deviation of new dataset on adding a constant to each observations of old dataset, i.e., new standard deviation = old standard deviation. For example: Consider the dataset in the example 1 of population standard deviation, if we add 4 to each observations then population standard deviation of new dataset will be: xi y i = xi + 4 yi − ȳ (yi − ȳ)2 68 72 9 81 79 83 20 400 38 42 −21 441 68 72 9 81 35 39 −24 576 70 74 11 121 61 65 2 4 47 51 −12 144 58 62 −1 1 P 66 P 70 −7 49 (yi − ȳ)2 = 1898 P xi = 590 yi = 630 630 Mean (ȳ) = = 63 10 r Pn √ r i=1 (yi − ȳ)2 1898 Population standard deviation = = = 189.8 = 13.78, which is n 10 same as the population standard deviation of old dataset. Multiplying a constant Suppose x1 , x2 ,... , xn are observations of a dataset and let yi = xi × c, where c is a constant Page 51 then, new standard deviation = c × old standard deviation. Proof: Let population standard deviation of old dataset x1 , x2 ,... , xn is σold and for new dataset is σnew. Now, r Pn 2 i=1 (yi − ȳ) σnew = n On substituting the values of yi = xi × c and ȳ = x̄ × c in the above equation, we get r Pn 2 i=1 (cxi − cx̄) σnew = n r Pn 2 i=1 (c(xi − x̄)) = n r Pn c2 i=1 (xi − x̄)2 = n r Pn 2 i=1 (xi − x̄) =c× n = c × σold For example: Consider the dataset in the example 1 of population standard deviation, if we multiplied by 0.5 to each observations then population standard deviation of new dataset will be: xi yi = 0.5 × xi yi − ȳ (yi − ȳ)2 68 34 4.5 20.25 79 39.5 10 100 38 19 −10.5 110.25 68 34 4.5 20.25 35 17.5 −12 144 70 35 5.5 30.25 61 30.5 1 1 47 23.5 −6 36 58 29 −0.5 0.25 P 66 P 33 3.5 12.25 (yi − ȳ)2 = 474.5 P xi = 590 yi = 295 295 Mean (ȳ) = = 29.5 10 r Pn i=1 (yi − ȳ)2 √ Population standard deviation = = 47.45 = 6.89 = 0.5 × 13.78. n Page 52 4.5 Percentiles The sample 100p percentile is that data value having the property that at least 100p percent of the data are less than or equal to it and at least 100(1 − p) percent of the data values are greater than or equal to it. We can understand the percentiles from the following figure: In the above figure, we can interpret that P9 which is 99th percentiles would have 100 × 0.99, i.e., 99% of the data is less than it, but 1% is greater than it. Similarly, P1 says 1% is less than it whereas, 99% is greater than or equal to it. Thus, the percentiles tells us that value in the dataset below which we have 100 × p which are less than or equal to it and 100 × (1–p) which are greater than or equal to it. And, If two data values satisfy this condition, then the sample 100p percentile is the arithmetic average of these values. 4.5.1 Computing Percentiles To find the sample 100p percentile of a data set of size n, we need to follow the following steps: (1) Arrange the data in increasing order. (2) If np is not an integer, determine the smallest integer greater than np. The data value in that position is the sample 100p percentile. (3) If np is an integer, then the average of the values in positions np and np + 1 is the sample 100p percentile. Examples: Consider the dataset 68, 38, 66, 79, 61, 47, 68, 35, 70, 58. (1) Compute the 25th percentiles of the dataset. Solution: First, arrange data in ascending order 35, 38, 47, 58, 61, 66, 68, 68, 70, 79. Here, n = 10 and p = 0.25 np = 10 × 0.25 = 2.5 Since np is in decimal and the smallest integer greater than 2.5 is 3. So, 3rd observation of the dataset will be 25th percentile which is 47. Page 53 (2) Compute the 75th percentiles of the dataset. Solution: First, arrange data in ascending order 35, 38, 47, 58, 61, 66, 68, 68, 70, 79. Here, n = 10 and p = 0.75 np = 10 × 0.75 = 7.5 Since np is in decimal and the smallest integer greater than 7.5 is 8. So, 8th observation of the dataset will be 75th percentile which is 68. (3) Compute the 10th percentiles of the dataset. Solution: First, arrange data in ascending order 35, 38, 47, 58, 61, 66, 68, 68, 70, 79. Here, n = 10 and p = 0.10 np = 10 × 0.10 = 1 Since np is in integer, so we need to take the average of 1st observation and 2nd obser- 35 + 38 vation. So, 10th percentile of the dataset is = 36.5. 2 (4) Compute the 50th percentiles of the dataset. Solution: First, arrange data in ascending order 35, 38, 47, 58, 61, 66, 68, 68, 70, 79. Here, n = 10 and p = 0.50 np = 10 × 0.50 = 5 Since np is in integer, so we need to take the average of 5th observation and 6th observa- 61 + 66 tion. So, 50th percentile of the dataset is = 63.5. 2 4.6 Quartiles Quartiles are the values that divide a given dataset into four parts by three quarters. The sample 25th percentile is called the first quartile, the sample 50th percentile is called the median or the second quartile and the sample 75th percentile is called the third quartile. In other words, the quartiles break up a data set into four parts with about 25 percent of the data values being less than the first(lower) quartile, about 25 percent being between the first and second quartiles, about 25 percent being between the second and third(upper) quartiles, and about 25 percent being larger than the third quartile. Also, Q1 represents the first quartile, Q2 represents the second quartile and Q3 represents the third quartile of the dataset. For examples: (1) In the example 1 of percentile, value of 25th percentile is 47 which is Q1 (first quartile). (2) In the example 4 of percentile, value of 50th percentile is 63.5 which is Q2 (second quar- tile). (3) In the example 2 of percentile, value of 75th percentile is 68 which is Q3 (first quartile). Page 54 4.7 Five Number Summary Five number summary is a very good way of summarizing a dataset and it is a set of descriptive statistics that provides information about the dataset. Five number summary are as follows: Minimum Q1 : First Quartile Q2 : Second Quartile or Median Q3 : Third Quartile Maximum For example: Find the five number summary of the dataset 18, 28, 16, 29, 11, 27, 26, 35, 37, 28. Solution: First, arrange data in ascending order 11, 16, 18, 26, 27, 28, 28, 29, 35, 37. Minimum value of the dataset is 11. n = 10 and p = 0.25, np = 10 × 0.25 = 2.5. Thus, 3rd observation of the dataset is the value of Q1 which is 18. Now, np = 10 × 0.50 = 5. Thus, average of 5th observation and 6th observation of the dataset is the value of Q2 which 27 + 28 is = 27.5. 2 Now, np = 10 × 0.75 = 7.5. Thus, 8th observation of the dataset is the value of Q3 which is 29. Maximum value of the dataset is 37. Hence, five number summary of the dataset are 11, 18, 27.5, 29, 37. 4.8 Interquartile Range (IQR) The interquartile range, IQR, is the difference between the first and third quartiles. IQR = Q3 − Q1 For example: In the example 1 of percentile, value of 25th percentile is 47 which is Q1 (first quartile) and in the example 2 of percentile, value of 75th percentile is 68 which is Q3 (first quartile). Thus, IQR of the dataset will be Q3 − Q1 = 68 − 47 = 21. Page 55 Chapter 5 5 Association between two variables Association between two variables means knowing information about one variable provides information about the other variable. 5.1 Association Between Two Categorical Variables To find the association between two categorical variables, first we have to make a contingency table and need to consider the following criteria. If the row relative frequencies (the column relative frequencies) are the same for all rows (columns) then we say that the two variables are not associated with each other. If the row relative frequencies (the column relative frequencies) are different for some rows (some columns) then we say that the two variables are associated with each other. Note: To know the association between two categorical variables from the contingency table, need to calculate either row relative frequencies or column relative frequencies. Examples: (1) A market research firm is interested in finding out whether ownership of a smartphone is associated with gender of a student. For this, a group of 100 college going children were surveyed about whether they owned a smart phone or not and following information is received. (i) There are 44 female and 56 male students. (ii) 76 students owned a smartphone and 24 did not own. (iii) 34 female students owned a smartphone and 42 male students owned a smartphone. Now, the given data can be organized in a contingency table as follows: Gender Own a smartphone No Yes Row Total Female 10 34 44 Male 14 42 56 Column Total 24 76 100 Table 5.1 Now, we can find the table for row relative frequency by dividing each cell frequency in a row by its row total: Page 56 Gender Own a smartphone No Yes Row Total 10 34 Female 44 44 44 14 42 Male 56 56 56 24 76 Column Total 100 100 100 Table 5.2 Gender Own a smartphone No Yes Row Total Female 22.73% 77.27% 44 Male 25.00% 75.00% 56 Column Total 24.00% 76.00% 100 Table 5.3 In the above table, we can easily observe that row relative frequencies are the same for all the rows. So, we can say that two categorical variables, i.e., Gender and Smartphone ownership are not associated with each other. We can also find the association between two categorical variables by column relative frequency which can be computed by by dividing each cell frequency in a column by its coulmn total. And, column relative frequency for the dataset in Table 5.1 is as follows: Gender Own a smartphone No Yes Row Total 10 34 44 Female 24 76 100 14 42 56 Male 24 76 100 Column Total 24 76 100 Table 5.4 Page 57 Gender Own a smartphone No Yes Row Total Female 41.67% 44.74% 44.00% Male 58.33% 55.26% 56.00% Column Total 24 76 100 Table 5.5 In the above table, we can easily observe that column relative frequencies are the same for all the columns. So, we can say that two categorical variables, i.e., Gender and Smartphone ownership are not associated with each other. Hence, we can observe from Table 5.3 and Table 5.5, if the row relative frequencies (the column relative frequencies) are the same for all rows (columns) then we say that the two variables are not associated with each other. (2) An analyst is interested in finding out whether ownership of a smartphone is associated with the income of an individual. For this, a group of 100 randomly picked individuals were surveyed about whether they owned a smart phone or not and following information is received. (i) There are 20 high income, 66 medium income and 14 low income individuals. (ii) 62 individuals owned a smartphone and 38 did not own. (iii) 18 High income individuals owned a smartphone, 39 Medium income individuals owned a smartphone, and 5 Low income individuals owned a smartphone. Now, the given data can be organized in a contingency table as follows: Income Level Own a smartphone No Yes Row Total High 2 18 20 Medium 27 39 66 Low 9 5 14 Column Total 38 62 100 Table 5.6 Now, row relative frequency table for the dataset in Table 5.6 is as follows: Page 58 Income Level Own a smartphone No Yes Row Total 2 18 High 20 20 20 27 39 Medium 66 66 66 9 5 Low 14 14 14 38 62 Column Total 100 100 100 Table 5.7 Income Level Own a smartphone No Yes Row Total High 10.00% 90.00% 20 Medium 40.91% 59.09% 66 Low 64.29% 35.71% 14 Column Total 38.00% 62.00% 100 Table 5.8 Here, the row relative frequencies are different for some rows. Hence, we can say that the two categorical variables, i.e, income level and smartphone ownership are associated with each other. Similarly, we can find the association between “Income level” and “smartphone owner- ship” by computing the column relative frequency as follows: Page 59 Income Level Own a smartphone No Yes Row Total 2 18 20 High 38 62 100 27 39 66 Medium 38 62 100 9 5 14 Low 38 62 100 Column Total 38 62 100 Table 5.9 Income Level Own a smartphone No Yes Row Total High 5.26% 29.03% 20.00% Medium 71.05% 62.90% 66.00% Low 23.68% 8.06% 14.00% Column Total 38 62 100 Table 5.10 Here, the column relative frequencies are different for some columns. Hence, we can say that the two categorical variables, i.e, income level and smartphone ownership are

Use Quizgecko on...
Browser
Browser