Applied Business Statistics PDF

# Chapter 2 - Summarising Data: Summary Tables and Graphs ## 2.1 Introduction Managers can only benefit from statistical findings if the information can easily be interpreted and effectively communicated to them. Summary tables and graphs are commonly used to convey descriptive statistical results. A table or a graph can convey information much more quickly and vividly than a written report. For graphs in particular, there is much truth in the adage 'a picture is worth a thousand words'. In practice, an analyst should always consider using summary tables and graphical displays ahead of written texts, in order to convey statistical information to managers. Summary tables and graphs can be used to summarise (or profile) a single random variable (e.g. most-preferred TV channel by viewers or pattern of delivery times) or to examine the relationship between two random variables (e.g. between gender and newspaper readership). The choice of a summary table and graphic technique depends on the data type being analysed (i.e. categorical or numeric). The sample dataset in Table 2.1 of the shopping habits of 30 grocery shoppers will be used to illustrate the different summary tables and graphs. ## 2.2 Summarising Categorical Data ### Single Categorical Variable #### Categorical Frequency Table A categorical frequency table summarises data for a single categorical variable. It shows how many times each category appears in a sample of data and measures the relative importance of the different categories. Follow these steps to construct a categorical frequency table: - List all the categories of the variable (in the first column). - Count and record (in the second column) the number of occurrences of each category. - Convert the counts per category (in the third column) into percentages of the total sample size. This produces a percentage categorical frequency table. It is always a good idea to express the counts as percentages because this makes them easy to understand and interpret. In addition, it makes the comparisons between samples of different sizes easier to explain. A categorical frequency table can be displayed graphically either as a bar chart or a pie chart. #### Bar Chart To construct a bar chart, draw a horizontal axis (x-axis) to represent the categories and a vertical axis (y-axis) scaled to show either the frequency counts or the percentages of each category. Then construct vertical bars for each category to the height of its frequency count (or percentage) on the y-axis. Note that the sum of the frequency counts (or %) across the bars must equal the sample size (or 100%). The bars must be of equal width to avoid distorting a category's importance. However, neither the order of the categories on the x-axis, nor the widths of the bars matter. It is only the bar heights that convey the information of category importance. #### Pie Chart To construct a pie chart, divide a circle into category segments. The size of each segment must be proportional to the count (or percentage) of its category. The sum of the segment counts (or percentages) must equal the sample size (or 100%). ### Example 2.1 Grocery Shoppers Survey A market research company conducted a survey amongst grocery shoppers to identify their demographic profile and shopping patterns. A random sample of 30 grocery shoppers was asked to complete a questionnaire that identified: - at which grocery store they most preferred to shop - the number of visits to the grocery store in the last month - the amount spent last month on grocery purchases - their age, gender and family size. The response data to each question is recorded in Table 2.1. Each column shows the 30 responses to each question and each row shows the responses of a single grocery shopper to all six questions. Refer to the 'store preference' variable in Table 2.1. - Construct a percentage frequency table to summarise the store preferences of the sample of 30 grocery shoppers. - Show the findings graphically as a bar chart and as a pie chart. #### Management Questions 1. Which grocery store is most preferred by shoppers? 2. What percentage of shoppers prefer this store? 3. What percentage of shoppers prefer to shop at Spar grocery stores? #### Solution 1. For the categorical variable 'store preference' there are three categories of grocery stores that shoppers use: 1 = Checkers; 2 = Pick n Pay; 3 = Spar. 2. To construct the percentage frequency table, first count the number of shoppers that prefer each store - there are 10 ones (Checkers), 17 twos (Pick n Pay) and 3 threes (Spar). Then convert the counts into percentages by dividing the count per store by 30 (the sample size) and multiplying the result by 100 (i.e. Checkers 10/30 * 100 = 33.3%; Pick n Pay = 17/30 * 100 = 56.7%; Spar = 3/30 * 100 = 10%). 3. The percentage frequency table of grocery store preferences is shown in Table 2.2.. | Preferred store | Count | Percentage | |---|---|---| | 1 = Checkers | 10 | 33.3% | | 2 = Pick n Pay | 17 | 56.7% | | 3 = Spar | 3 | 10.0% | | Total | 30 | 100% | #### Management Interpretation - The grocery store most preferred by shoppers is Pick n Pay. - More than half of the sampled shoppers (56.7%) prefer to shop at Pick n Pay for their groceries. - Only 10% of the sampled shoppers prefer to do their grocery shopping at Spar. #### Charts and graphs must always be clearly and adequately labelled with headings, axis titles and legends to make them easy to read and to avoid any misrepresentation of information. The data source must, where possible, also be identified to allow a user to assess the credibility and validity of the summarised findings. #### Bar charts and pie charts display the same information graphically. In a bar chart, the importance of a category is shown by the height of a bar, while in a pie chart this importance is shown by the size of each segment (or slice). The differences between the categories are clearer in a bar chart, while a pie chart conveys more of a sense of the whole. A limitation of both the bar chart and the pie chart is that each displays the summarised information on only one variable at a time. ### Two Categorical Variables #### Cross-tabulation Table A cross-tabulation table (also called a contingency table) summarises the joint responses of two categorical variables. The table shows the number (and/or percentage) of observations that jointly belong to each combination of categories of the two categorical variables. This summary table is used to examine the association between two categorical measures. Follow these steps to construct a cross-tabulation table: - Prepare a table with m rows (m = the number of categories of the first variable) and n columns (n = the number of categories of the second variable), resulting in a table with (m × n) cells. - Assign each pair of data values from the two variables to an appropriate category-combination cell in the table by placing a tick in the relevant cell. - When each pair of data values has been assigned to a cell in the table, count the number of ticks per cell to derive the joint frequency count for each cell. - Sum each row to give row totals per category of the row variable. - Sum each column to give column totals per category of the column variable. - Sum the column totals (or row totals) to give the grand total (sample size). - These joint frequency counts can be converted to percentages for easier interpretation. The percentages could be expressed in terms of the total sample size (percent of total), or of row subtotals (percent of rows) or of column subtotals (percent of columns). To determine whether an association exists between two categorical variables, compare the overall percentage profile of one of the categorical variables to the percentage profile of this same variable for each level of the second categorical variable. If the overall percentage profile is the same (or very similar) to each level's percentage profile, then there is no association. If at least one level's percentage profile differs significantly from the overall percentage profile, then an association exists. The cross-tabulation table can be displayed graphically either as a stacked bar chart (also called a component bar chart) or a multiple bar chart. An inspection of the charts can also reveal evidence of an association of not. The more similar the chart profiles are (based on either row percentage-wise or column percentage-wise computations), the less likely there is an association and vice versa. #### Stacked Bar Chart Follow these steps to construct a stacked bar chart: - Choose, say, the row variable, and plot the frequency of each category of this variable as a simple bar chart. - Split the height of each bar in proportion to the frequency count of the categories of the column variable. - This produces a simple bar chart of the row variable with each bar split proportionately into the categories of the column variable. The categories of column variable are 'stacked' on top of each other within each category bar of the row variable. #### Multiple Bar Chart Follow these steps to construct a multiple bar chart: - For each category of, say, the row variable, plot a simple bar chart constructed from the corresponding frequencies of the categories of the column variable. - Display these categorised simple bar charts next to each other on the same axes. The multiple bar chart is similar to a stacked bar chart, except that the stacked bars are displayed next to rather than on top of each other. The two charts convey exactly the same information on the association between the two variables. They differ only in how they emphasise the relative importance of the categories of the two variables. ### Example 2.2 Grocery Shoppers Survey - Store Preferences by Gender Refer to the 'store preference' variable and the 'gender' variable in Table 2.1. 1. Construct a cross-tabulation table of frequency counts between 'store preference' (as the row variable) and 'gender' (as the column variable) of shoppers surveyed. 2. Display the cross-tabulation as a stacked bar chart and as a multiple bar chart. 3. Construct a percentage cross-tabulation table to show the percentage split of gender for each grocery store. #### Management Questions 1. How many shoppers are male and prefer to shop at Checkers? 2. What percentage of all grocery shoppers are females who prefer Pick n Pay? 3. What percentage of all Checkers' shoppers are female? 4. Of all male shoppers, what percentage prefer to shop at Spar for their groceries? 5. Is there an association between gender and store preference (i.e. does store preference differ significantly between male and female shoppers)? #### Solution 1. The row categorical variable is 'store preference': 1 = Checkers; 2 = Pick n Pay; 3 = Spar. The column categorical variable is 'gender': 1 = female; 2 = male. | Store | Gender | Total | |-------------|--------|------| | 1 = Checkers | 7 | 10 | | | 3 | | | 2 = Pick n Pay| 10 | 17 | | | 7 | | | 3 = Spar | 2 | 3 | | | 1 | | | Total | 19 | 30 | | | 11 | | 2. To produce the cross-tabulation table, count how many females prefer to shop at each store (Checkers, Pick n Pay and Spar) and then count how many males prefer to shop at each store (Checkers, Pick n Pay and Spar). These joint frequency counts are shown in Table 2.3. The cross-tabulation table can also be completed using percentages (row percentages, column percentages or as percentages of the total sample). 3. Figure 2.3 and Figure 2.4 show the stacked bar chart and multiple bar chart respectively for the cross-tabulation table of joint frequency counts in Table 2.3. 4. Table 2.4 shows, for each store separately, the percentage split by gender (row percentages), while Table 2.5 shows, for each gender separately, the percentage breakdown by the grocery store preferred (column percentages). #### Table 2.4 Row percentage cross-tabulation table (store preferences by gender) | Store | Gender | Total | |-------------|--------|------| | 1 = Checkers | 70% | 100% | | | 30% | | | 2 = Pick n Pay| 59% | 100% | | | 41% | | | 3 = Spar | 67% | 100% | | | 33% | | | Total | 63% | 100% | | | 37% | | From Table 2.4, of those shoppers who prefer Checkers, 70% are female and 30% are male. Similarly, of those who prefer Pick n Pay, 59% are female and 41% are male. Finally, 67% of customers who prefer to shop at Spar are female, while 33% are male. Overall, 63% of grocery shoppers are female, while only 37% are male. #### Table 2.5 Column percentage cross-tabulation table (store preferences by gender) | Store | Gender | Total | |-------------|--------|------| | 1 = Checkers | 37% | 33% | | | 27% | | | 2 = Pick n Pay| 53% | 57% | | | 64% | | | 3 = Spar | 11% | 10% | | | 9% | | | Total | 100% | 100% | | | 100% | | From Table 2.5, of all female shoppers, 37% prefer Checkers, 53% prefer Pick n Pay and 11% prefer to shop for groceries at Spar. For males, 27% prefer Checkers, 64% prefer Pick n Pay and the balance (9%) prefer to shop at Spar for their groceries. Overall, 33% of all shoppers prefer Checkers, 57% prefer Pick n Pay and only 10% prefer Spar for grocery shopping. #### Management Interpretation 1. Of the 30 shoppers surveyed, there are only three males who prefer to shop at Checkers. 2. 33.3% (10 out of 30) of all shoppers surveyed are females who prefer to shop at Pick n Pay. 3. 70% (7 out of. 10) of all Checkers shoppers are female. (Refer to the row percentages in Table 2.4.) 4. Only 9% (1 out of 11) of all males prefer to shop at Spar. (Refer to the column percentages in Table 2.5.) 5. Since the percentage breakdown between male and female shoppers across the three grocery stores is reasonably similar to the overall gender profile regardless of store preference (i.e. 63% female and 37% male), it can be concluded that gender and store preference are not statistically associated. ## 2.3 Summarising Numeric Data Numeric data can also be summarised in table format and displayed graphically. The table is known as a numeric frequency distribution and the graph of this table is called a histogram. From Table 2.1, the numeric variable 'age of shoppers', will be used to illustrate the construction of a numeric frequency distribution and its histogram. ### Single Numeric Variable #### Numeric Frequency Distribution A numeric frequency distribution summarises numeric data into intervals of equal width. Each interval shows how many numbers (data values) fall within the interval. Follow these steps to construct a numeric frequency distribution: - Determine the data range. Range = Maximum data value - Minimum data value - Choose the number of intervals (k). While there is no strict formula to find k, each one of the following rules can be used as a guide based on sample size (n): Sturges' rule (k=1+2.322*log10(n)); Rice's rule (k=2xn) and the Square-root rule (k = √n). X-Static uses Rice's rule. As a general rule, choose between 5 and 10 intervals, depending on the sample size: the smaller the sample size, the fewer the number of intervals, and vice versa. For n = 30 shoppers, choose five intervals. - Determine the interval width. Interval width = Data range/Number of intervals - Use this as a guide to determine a 'neat' interval width. For the 'age' variable, the approximate interval width is 46/5 = 9.2 years. Hence choose an interval width of 10 years. - Set up the interval limits. The lower limit for the first interval should be a value smaller than or equal to the minimum data value and should be a number that is easy to use. Since the youngest shopper is 23 years old, choose the lower limit of the first interval to be 20. The lower limits for successive intervals are found by adding the interval width to each preceding lower limit. The upper limits are chosen to avoid overlaps between adjacent interval limits. | Lower limit | Upper limit | |---|---| | 20 | < 30 (or 29) | | 30 | < 40 (or 39) | | 40 | < 50 (or 49) | | 50 | < 60 (or 59) | | 60 | < 70 (or 69) | The format of <30 (less than 30) should be used if the source data is continuous, while an upper limit such as 29 can be used if the data values are discrete. - Tabulate the data values. Assign each data value to one, and only one, interval. A count of the data values assigned to each interval produces the summary table, called the numeric frequency distribution. When constructing a numeric frequency distribution, ensure that: - the interval widths are equal in size - the interval limits do not overlap (i.e. intervals must be mutually exclusive) - each data value is assigned to only one interval - the intervals are fully inclusive (i.e. cover the data range) - the sum of the frequency counts must equal the sample size, n, or that the percentage frequencies sum to 100%. The frequency counts can be converted to percentages (or proportions) by dividing each frequency count by the sample size. The resultant summary table is called a percentage (or relative) frequency distribution. It shows the percentage (or proportion) of data values within each interval. #### Histogram A histogram is a graphic display of a numeric frequency distribution. Follow these steps to construct a histogram: - Arrange the intervals consecutively on the x-axis from the lowest interval to the highest. There must be no gaps between adjacent interval limits. - Plot the height of each bar (on the y-axis) over its corresponding interval, to show either the frequency count or percentage frequency of each interval. The area of a bar (width x height) measures the density of values in each interval. ### Example 2.3 Grocery Shoppers Survey - Profiling the Ages of Shoppers Refer to the 'age of shoppers' variable in Table 2.1. 1. Construct a numeric frequency distribution for the age profile of grocery shoppers. 2. Compute the percentage frequency distribution of shoppers' ages. 3. Construct a histogram of the numeric frequency distribution of shoppers' ages. #### Management Questions 1. How many shoppers are between 20 and 29 years of age? 2. What is the most frequent age interval of shoppers surveyed? 3. What percentage of shoppers belong to the most frequent age interval? 4. What percentage of shoppers surveyed are 60 years or older? 5. What is the maximum age for the youngest 20% of shoppers surveyed? #### Solution 1. and 2. The numeric and percentage frequency distributions for the ages of grocery shoppers are shown in Table 2.6, and are based on the steps shown above. | Age (years) | Tally | Count | Percentage | |---|---|---|---| | 20-29 | | 6 | 20% | | 30-39 | | 9 | 30% | | 40-49 | | 8 | 27% | | 50-59 | | 4 | 13% | | 60-69 | | 3 | 10% | | Total | | 30 | 100% | 3. Figure 2.5 shows the histogram of the numeric frequency distribution for shoppers' ages. #### Management Interpretation 1. There are six shoppers between the ages of 20 and 29 years. 2. The most frequent age interval is between 30 and 39 years. 3. 30% of shoppers surveyed are between 30 and 39 years of age. 4. 10% of shoppers surveyed are 60 years or older 5. The youngest 20% of shoppers are no older than 29 years. If the numeric data are discrete values in a limited range (5-point rating scales, number of children in a family, number of customers in a bank queue, for example), then the individual discrete values of the random variable can be used as the 'intervals' in the construction of a numeric frequency distribution and a histogram. This is illustrated in Example 2.4 below. ### Example 2.4 Grocery Shoppers Survey - Profiling the Family Size of Shoppers Refer to the random variable 'family size' in the database in Table 2.1. Construct a numeric and percentage frequency distribution and histogram of the family size of grocery shoppers surveyed. #### Management Questions 1. Which is the most common family size? 2. How many shoppers have a family size of three? 3. What percentage of shoppers have a family size of either three or four? #### Management Interpretation 1. The most common family size of grocery shoppers is two. 2. There are eight shoppers that have a family size of three. 3. 43.4% (26.7% + 16.7%) of shoppers surveyed have a family size of either three or four. #### Cumulative Frequency Distribution Data for a single numeric variable can also be summarised into a cumulative frequency distribution. A cumulative frequency distribution is a summary table of cumulative frequency counts which is used to answer questions of a 'more than' or 'less than' nature. ### Example 2.5 Grocery Shoppers Survey - Analysis of Grocery Spend Refer to the numeric variable 'spend' (amount spent on groceries last month) in Table 2.1. 1. Compute the numeric frequency distribution and percentage frequency distribution for the amount spent on groceries last month by grocery shoppers. 2. Compute the cumulative frequency distribution and its graph, the ogive, for the amount spent on groceries last month. #### Management Questions 1. What percentage of shoppers spent less than R1 200 last month? 2. What percentage of shoppers spent R1 600 or more last month? 3. What percentage of shoppers spent between R800 and R1 600 last month? 4. What was the maximum amount spent last month by the 20% of shoppers who spent the least on groceries? Approximate your answer. 5. What is the approximate minimum amount spent on groceries last month by the top-spending 50% of shoppers? #### Solution 1. The numeric frequency distribution for amount spent is computed using the construction steps outlined earlier. The range is R2 136-R456 = R1 680. Choosing five intervals, the interval width can be set to a 'neat' width of R400 (based on R1 680/5 = R336). The lower limit of the first interval is set at a 'neat' limit of R400, since the minimum amount spent is R456. The numeric and percentage frequency distributions are both shown in Table 2.8. | Grocery spend (R) | Count | Percentage | |---|---|---| | 400-<800 | 7 | 23.3% | | 800-<1200 | 14 | 46.7% | | 1200-1600 | 5 | 16.7% | | 1600-2000 | 3 | 10.0% | | 2000-2400 | 1 | 3.3% | | Total | 30 | 100% | 2. The cumulative frequency distribution (ogive) for amount spent on groceries last month is computed using the construction guidelines outlined above for the ogive. Based on the numeric frequency distribution in Table 2.8, with R400 being the minimum grocery spend, the following cumulative counts are derived: - 7 shoppers spent up to R800 - 21 (= 7 +14) shoppers spent up to R1 200 - 26 (= 21 + 5) shoppers spent up to R1 600 - 29 (= 26 + 3) shoppers spent up to R2 000 - all 30 shoppers (= 29 + 1) spent no more than R2 400 on groceries last month. #### Management Interpretation 1. 70% of shoppers spent less than R1 200 on groceries last month. 2. 13.3% (100% - 86.7%) of shoppers spent R1 600 or more on groceries last month. 3. 63.4% (86.7% - 23.3% or 46.7% + 16.7%) of shoppers spent between R800 and R1 600 on groceries last month. 4. The bottom 20% of shoppers spent no more than R770 (approximately) on groceries last month. 5. From the y-axis value at 50%, the minimum amount spent on groceries by the top-spending 50% of shoppers is (approximately) R1 000. #### Note: The ogive is a less than cumulative frequency graph, but it can also be used to answer questions of a more than nature (by subtracting the less than cumulative percentage from 100%, or the cumulative count from n, the sample size). ### Two Numeric Variables The relationship between two numeric random variables can be examined graphically by plotting their values on a set of axes. The graphs that are useful to display the relationship between two numeric random variables are: a scatter plot, a trendline graph and a Lorenz curve. Each graph addresses a different type of management question. #### Scatter Plot A scatter plot displays the data points of two numeric variables on an x-y graph. A visual inspection of a scatter plot will show the nature of a relationship between the two variables in terms of its strength (the closeness of the points), its shape (linear or curved), its direction (direct or inverse) and any outliers (extreme data values). For example, a plot of advertising expenditure (on the x-axis) against sales (on the y-axis) could show what relationship, if any, exists between advertising expenditure and sales. Another example is to examine what influence training hours (on the x-axis) could have on worker output (on the y-axis). Follow these steps to construct a scatter plot: - Label the horizontal.axis (x-axis) with the name of the influencing variable (called the independent variable, x). - Label the vertical axis (y-axis) with the name of the variable being influenced (called the dependent variable, y). - Plot each pair of data values (x; y) from the two numeric variables as coordinates on an x-y graph. ### Example 2.6 Grocery Shoppers Survey - Amount Spent by Number of Store Visits Refer to the dataset in Table 2.1. Construct a scatter plot for the amount spent on groceries and the number of visits to the grocery store per shopper by the sample of 30 shoppers surveyed. #### Management Questions By inspection of the scatter plot, describe the nature of the relationship between the number of visits and amount spent. #### Solution To construct the scatter plot, we need to define the x and y variables. Since the number of visits is assumed to influence the amount spent on groceries in a month, let x = number of visits and y = amount spent. On a set of axes, plot each pair of data values for each shopper. For example, for shopper 1, plot x = 3 visits against y = R946; for shopper 2, plot x = 5 visits against y = R1 842. The results of the scatter plot are shown in Figure 2.8. #### Management Interpretation There is a moderate, positive linear relationship between the number of visits to a grocery store in a month and the total amount spent on groceries last month per shopper. The more frequent the visits, the larger the grocery bill for the month. There is only one possible outlier - shopper 13, who spent R2 136 over four visits. #### Note: Both the strength and the direction of the relationship that is observed in a scatter plot can be measured by a correlation coefficient (-1≤r≤+1) using the Excel function CORREL(arrayl, array2). This statistic is covered in detail in Chapter 12 (section 12.3). #### Trendline Graph A trendline graph plots the values of a numeric random variable over time. Such data are called time series data. The x-variable is time and the y-variable is a numeric measure of interest to a manager (such as turnover, unit cost of production, absenteeism or share prices). Follow these steps to construct a trendline graph: - The horizontal axis (x-axis) represents the consecutive time periods. - The values of the numeric random variable are plotted on the vertical (y-axis) opposite their time period. - The consecutive points are joined to form a trendline. Trendline graphs are commonly used to identify and track trends in time series data. ### Example 2.7 Factory Absenteeism Levels Study Refer to the time series data in Table 2.10 on weekly absenteeism levels at a car manufacturing plant. | Week | Absent | |---|---| | 1 | 54 | | 2 | 58 | | 3 | 94 | | 4 | 70 | | 5 | 61 | | 6 | 61 | | 7 | 78 | | 8 | 56 | | 9 | 49 | | 10 | 55 | | 11 | 95 | | 12 | 85 | | 13 | 60 | | 14 | 64 | | 15 | 99 | | 16 | 80 | | 17 | 62 | | 18 | 78 | | 19 | 88 | | 20 | 73 | | 21 | 65 | | 22 | 84 | | 23 | 92 | | 24 | 70 | | 25 | 59 | | 26 | 65 | | 27 | 105 | | 28 | 84 | | 29 | 80 | | 30 | 90 | | 31 | 112 | | 32 | 94 | Produce a trendline plot of the weekly absenteeism levels (number of employee-days absent) for this car manufacturing plant over a period of 32 weeks. #### Management Question By an inspection of the trendline graph, describe the trend in weekly absenteeism levels within this car manufacturing plant over the past 32 weeks. #### Solution To plot the trendline, plot the weeks (x = 1, 2, 3, ..., 32) on the x-axis. For each week, plot the corresponding employee-days absent on the y-axis. After plotting all 32 y-values, join the points to produce the trendline graph as shown in Figure 2.9. #### Management Interpretation Over the past 32 weeks there has been a modest increase in absenteeism, with an upturn occurring in more recent weeks. A distinct 'monthly' pattern exists, with absenteeism in each month generally low in weeks one and two, peaking in week three and declining moderately in week four. #### Lorenz Curve A Lorenz curvė plots the cumulative frequency distributions (ogives) of two numeric random variables against each other. Its purpose is to show the degree of inequality between the values of the two variables. For example, the Lorenz curve can be used to show the relationship between: - the value of inventories against the volume of inventories held by an organisation - the spread of the total salary bill amongst the number of employees in a company - the concentration of total assets amongst the number of companies in an industry - the spread of the taxation burden amongst the total number of taxpayers. A Lorenz curve shows what percentage of one numeric measure (such as inventory value, total salaries, total assets or total taxation) is accounted for by given percentages of the other numeric measure (such as volume of inventory, number of employees, number of companies or number of taxpayers). The degree of concentration or distortion can be clearly illustrated by a Lorenz curve. It is commonly used as a measure of social/economic inequality. It was originally developed by M Lorenz (1905) to represent the distribution of income amongst households. Follow these steps to construct a Lorenz curve: - Identify intervals (similar to a histogram) for the y-variable, for which the distribution across a population is being examined (e.g. salaries across employees). - Calculate the total value of the y-variable per interval (total value of salaries paid to all employees earning less than R1 000 per month; total value of salaries paid to all employees earning between R1 001 and R2 000 per month; etc.). - Calculate the total number of objects (e.g. employees, households or taxpayers) that fall within each interval of the y-variable (number of employees earning less than R1 000 per month; number of employees earning between R1 001 and R2 000 per month; etc.). - Derive the cumulative frequency percentages for each of the two distributions above. - Scale each axis (x and y) from 0% to 100%. - For each interval of the y-variable, plot each pair of cumulative frequency percentages on the axes and join the coordinates (similar to a scatter plot). If the distributions are similar or equal, the Lorenz curve will result in a 45° line from the origin of both axes (called the line of uniformity or the line of equal distribution). The more unequal the two distributions, the more bent (concave or convex) the curve becomes. A Lorenz curve always starts at coordinate (0%; 0%) and ends at coordinate (100%; 100%). ### Example 2.8 Savings Balances versus Number of Savers Study A bank wished to analyse the value of savings account balances against the number of savings accounts of a sample of 64 bank clients. The two numeric frequency distributions and their respective percentage ogives (for the value of savings balances and number of savings accounts) are given in Table 2.11.. | Percentage of total savings | Frequency distributions | Percentage ogives | |---|---|---| | | Savings balances (R) | Number of savers | Total savings (R) | Percentage of savers | | | | | | | | Below 0 | | 0 | 0 | 0 | | 0-<500 | | 12 | 4089 | 19 | | 500-<1000 | | 18 | 14 022 | 47 | | 1000-<3000 | | 25 | 35 750 | 86 | | 3000-5000 | | 6 | 24 600 | 95 | |

Applied Business Statistics PDF

Document Details

Tags

Related

Summary

Full Transcript