Chapter 2 One-variable descriptive stats - tabs and graphs PDF

Document Details

WellConnectedSeries

Uploaded by WellConnectedSeries

amU (Aix-Marseille University)

2025

Alain Paraponaris

Tags

descriptive statistics data analysis tabular displays graphical displays

Summary

This document is a lecture or presentation on one-variable descriptive statistics. It includes information on tabular and graphical displays for summarizing data, along with examples and explanations. It covers both categorical and quantitative data. The academic year is 2024-2025 and is part of an economics and management program.

Full Transcript

Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays Alain Paraponaris, ✉ [email protected] Licence degree in Economics & Management International Program in Economics and Management (IPEM)...

Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays Alain Paraponaris, ✉ [email protected] Licence degree in Economics & Management International Program in Economics and Management (IPEM) 1rst year, 1rst semester (L1S1) Academic year 2024-2025 1. Introduction 1. Introduction A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 2/42 1. Introduction Data can be classified as either categorical or quantitative ▶ Categorical data use labels or names to identify categories of like items ▶ Quantitative data are numerical values that indicate how much or how many This chapter introduces the use of tabular and graphical displays for summarizing both categorical and quantitative data ▶ Tabular and graphical displays can be found in annual reports, newspaper articles, and research studies ▶ It is important to understand how they are constructed and how they should be interpreted In this chapter, we will deal with: ▶ the use of tabular and graphical displays to summarize the data for a single variable ▶ the use of tabular and graphical displays to summarize the data for two variables in a way that reveals the relationship between the two variables ▶ how to represent data efficiently with the help of a software A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 3/42 2. Summarizing data for a categorical variable 2. Summarizing data for a categorical variable A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 4/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1 Frequency and relative frequency A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 5/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.1. Frequency distributions 2.1.1 Frequency distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 6/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.1. Frequency distributions A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several nonoverlapping categories or classes. ▶ Suppose a supermarket records the brands of soft drinks sold during a random hour in the opening day, giving the table below Click here to access the table in a spreadsheet named Soft drinks A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 7/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.1. Frequency distributions In this form, the table is difficult to use ▶ It is certainly more convenient to reorganize the table by counting the number of times the same brand has been sold The values of the variable Soft Drink are presented in alphabetical order, but could be arranged in ascending or descending order of sales frequency ▶ Then, Coca-Cola ranks first, Pepsi second, etc. and Dr. Pepper and Sprite are tied for fourth A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 8/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.1. Frequency distributions Let us denote ni the number of sales associated to soft drink’s brand i ▶ Let us associate arbitrarily (respecting the alphabetical order) to the label Coca-Cola the value 1, Diet Coke the value 2, etc., and Sprite the value 5 ▶ Then, n1 = 19, n2 = 8, etc., and n5 = 5 ▶ n1 , n2 , etc., and n5 represent a frequency ▶ We have: n1 + n2 + n3 + n4 + n5 = 19 + 8 + 5 + 13 + 5 = 50 ▶ For P5convenience, we can rewrite the above as: n = n1 + n2 + n3 + n4 + n5 = 19 + 8 + 5 + 13 + 5 = 50 i=1 i ▶ If the 50 sales represent the exhaustive number of soft drinks sales in the supermarket, we can P5 also write: n =N i=1 i A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 9/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.2. Relative frequency distributions 2.1.2 Relative frequency distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 10/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.2. Relative frequency distributions A frequency distribution shows the number (frequency) of observations in each of several nonoverlapping classes But we may be interested in the proportion, or percentage, of observations in each class ▶ The relative frequency of a class equals the fraction or proportion of observations belonging to a class ▶ For a data set with N observations, the relative frequency, denoted fi of a i-th class is determined as the frequency ni of this class divided by the total frequency N : ni ni fi = Pp = i=1 ni N where p represents the number of the various values (classes) taken by the variable (p = 5 in the table of soft drinks) ▶ By definition, fi ∈ [0; 1] so that for convenience, it is also possible (and easy) to compute the percent frequency of each class by multiplying the relative frequency by 100 ▶ Relative frequency and percent frequency are the same number:.38 = 38/100 = 38% A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 11/42 2. Summarizing data for a categorical variable 2.1. Frequency and relative frequency 2.1.2. Relative frequency distributions Coca-Cola concentrates 38% of all the soft drink sales, Pepsi 26%, etc., and Dr. Pepper and Sprite 10% both Note that: p p  p n  ni n1 n2 np Pp i X X X fi = = = + +... + i=1 i=1 i=1 ni i=1 N N N N p 1  1 X 1 = n1 + n2 +... + np = ni = × N = 1.00 = 100% N N i=1 N A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 12/42 2. Summarizing data for a categorical variable 2.2. Bar charts and pie charts 2.2 Bar charts and pie charts A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 13/42 2. Summarizing data for a categorical variable 2.2. Bar charts and pie charts A bar chart is a graphical display for depicting categorical data summarized in a frequency, relative frequency, or percent frequency distribution ▶ On the (usually) horizontal axis, labels used for the classes (categories) are specified ▶ A frequency, relative or percent frequency scale can be used for the (usually) vertical axis ▶ Then, using a bar of fixed width drawn above each class label, we extend the length of the bar until we reach the frequency, relative frequency, or percent frequency of the class ▶ For categorical data, the bars should be separated because each category is separate A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 14/42 2. Summarizing data for a categorical variable 2.2. Bar charts and pie charts The pie chart provides another graphical display for presenting relative frequency and percent frequency distributions for categorical data ▶ Relative frequencies are used to subdivide a circle into sectors, or parts, that correspond to the relative frequency for each class ▶ The area of the sector associated to any value taken by the variable must represent exactly the same part of the total area of the disc (equal to π × r2 with r as the disc radius) as the one of the frequency associated to the value in the total frequency ▶ For instance, the sector representing the part of Coca-Cola sales in the total soft drink sales consists of:.38 × 360 = 136.8 degrees A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 15/42 3. Summarizing data for a quantitative variable 3. Summarizing data for a quantitative variable A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 16/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1 Frequency and relative frequency A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 17/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions 3.1.1 Frequency distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 18/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions As for categorical variables, a frequency distribution is a tabular summary of data showing the number (frequency) of observations in each nonoverlapping category or class If self-evident for categorical variables (which categories are obviously separated), quantitative data require to be more careful in defining the nonoverlapping classes ▶ Consider the fictious example of an accounting company which studies the year-end audit times for a sample of 20 clients (table on the left) ▶ The simplest way to represent the frequency distribution of audit times is given by the dot-plot on the right A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 19/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions The previous representation may be inconvenient, especially because some values of the variables are not taken by the audit time data of the 20 clients In this situation, it can be preferred to create classes aggregating several successive values of the variable at stake Aggregation can be based on accepted social, economic, financial or other criteria ▶ Annual average grades got by students in France range from 0 to 20 (included) ▶ They can be grouped in the following categories: [0;10[, [10;12[, [12;14[, [14;16[ and [16;20] When there is no indication for an usual aggregation, the statistician can proceed as follows ▶ Number of classes ▶ Classes are formed by specifying ranges that will be used to group the data: 5 and 20 classes are recommended, depending on the number of elements in the population or sample ▶ For a small number of data items, as few as five or six classes may be used to summarize the data; for a larger number of data items, a larger number of classes are usually required ▶ The goal is to use enough classes to show the variation in the data, but not so many classes that some contain only a few data items A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 20/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions ▶ Width of classes ▶ As a general guideline, it is recommended that the width be the same for each class ▶ As a result, the choices of the number of classes and the width of classes are not independent: a larger (respectively smaller) number of classes means a smaller (larger) class width ▶ An approximate class width is given by: range of values largest value - smallest value Width = = number of classes number of classes ▶ In our example, as the sample size is small (n = 20), it is indicated to create 5 classes ▶ Therefore, the approximate width of each class is equal to: 33 − 12 21 Width = = = 4.2 5 5 which means that 5 classes will be considered with a width equal to 5 days A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 21/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions ▶ Here is a possible aggregation in 5 classes from the original distribution which would result in a new table (left) and histogram (right) ▶ The convenience of the histogram is to make it immediate that audits most often require 15 to 19 days ▶ When the width of each class is the same, the histogram enables to identify the mode (most frequent value or class of values taken by the variable) of the distribution ▶ This is not true when classes have different widths A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 22/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions Comparing the histogram drawn for the audit time with quantitative data with the one got for the soft drink purchases with categorical variables, the adjacent rectangles touch one another Unlike a bar graph, a histogram contains no natural separation between the rectangles of adjacent classes and this format is the usual convention for histograms ▶ There is no space between rectangles because there is no discontinuity between classes: all values between the lower bound of the first class and the upper bond of the last class are possible ▶ This point will be of a major importance when the representation of cumulative distributions will be discussed A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 23/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions One of the most important uses of a histogram is to provide information about the shape of a distribution Panel A (respectively panel B) shows the histogram for a set of data moderately skewed to the left (resp. to the right): its tail extends farther to the left (resp. to the right) ▶ Variables with a left-skewed distribution: exam scores (in graduate studies), death age (for natural causes) ▶ Variables with a right-skewed distribution: exam scores (in undergraduate studies), wages, income, housing price, number of goal per match during the 2022 World Cup A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 24/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions Panel C shows a symmetric histogram: the left tail mirrors the shape of the right tail ▶ Histograms are never perfectly symmetric, but heights, weights, cows daily milk production give rise to more or less symmetric distributions Panel D shows a histogram highly skewed to the right ▶ Examples: housing prices, salaries, purchase amounts A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 25/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions If convenient for the presentation of the data either with a table or a figure, the aggregation into classes is yet not neutral regarding the calculous of statistics ▶ The calculous of the average time for audits from the original table gives a mean time of (12 + 15 +... + 13)/20 = 19.25 days ▶ With aggregated data into 5 classes, the calculous requires first the definition class midpoints: lower bound of the class + upper bound of the class Class midpoint = 2 ▶ It makes less precise the results: instead of the true value of the variable associated to the number of elements presenting this value, the calculous will associate an approximate value (the class midpoint) to all the elements belonging to the class, whathever they actually present a value equal to the midpoint or not ▶ In the case of audit times, it would give a mean time of 19.5 days (instead of 19.25) with data aggregated into 5 classes of width equal to 5 days (details on this kind calculous will be given in a forthcoming section of the chapter) ▶ The smaller the number of classes and the wider the width, the greater the bias A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 26/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions The previous bias can be partly avoided with the use of stem-and-leaf representation Suppose that students of the Faculty have been proposed an aptitude test which students’ correct answers are given by the table below To develop a stem-and-leaf display, the leading digits of each data value is put to the left of a vertical line and the last digit for each data value is put to the right A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 27/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.1. Frequency distributions On the left, the raw stem-and-leaf and, on the right, the stem-and-leaf after reordering of the last digits associated to each given leading digits Rotating each figure counterclockwise onto its side provides a picture of the data that is similar to a histogram with classes of 60–69, 70–79, 80–89, and so on Although the stem-and-leaf display may appear to offer the same information as a histogram, it has two primary advantages: ▶ The stem-and-leaf display is easier to construct by hand ▶ Within a class interval, the stem-and-leaf display provides more information than the histogram because the stem-and-leaf shows the actual data A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 28/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.2. Relative frequency distributions 3.1.2 Relative frequency distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 29/42 3. Summarizing data for a quantitative variable 3.1. Frequency and relative frequency 3.1.2. Relative frequency distributions All that have being presented regarding frequency distributions continue to hold regarding the representation of relative frequency distributions The table giving the frequency distribution for the audit time data can be easily converted in relative frequency and percent frequency distributions ▶ The definition of a relative frequency fi associated to a given i-th value for the variable at stake remain the same: fi = ni /N ▶ Converting the relative frequency into percent frequency just requires to multiply fi by 100 A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 30/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2 Cumulative distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 31/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.1. Definition 3.2.1 Definition A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 32/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.1. Definition Compared to categorical variables, it makes sense to present frequency distributions for quantitative variables as cumulative frequencies ▶ For instance, it makes sense to assert that 12 students engaged in an undergraduate program got a grade at least equal to 14 (out of a maximum total of 20) ▶ It is also meaningful to state that 102 employees of a company earn less than €6,000 per month (net, before tax) Consider a quantitative variable that takes p different values ▶ As it is a quantitative variable, its values can be ordered, for instance, from the smallest one (the 1rst value) to the biggest one (the p-th value) ▶ Note that the variable can be genuinely ordered in descending order (from the p-th value to the 1rst) ▶ Suppose that the statistician is interested in the number of elements in the population presenting a value of the variable of interest at most equal to its k-th value A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 33/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.1. Definition ▶ The definition of the cumulative frequency Nk associated to the k-th value of this quantitative variable is given by: Xk Nk = ni i=1 ▶ Some properties of the cumulative frequency are quite straightforward: N0 = 0, Np = N ▶ It is also easy to deduce from the definition of the cumulative frequency the definition of the cumulative relative frequency associated to the k-th value of a quantitative variable: Pk k k Nk i=1 ni X ni n1 n2 nk X Fk = = = = + +... + = f1 + f2 +... + fk = fi N N i=1 N N N N i=1 ▶ As before, it is obvious that: F0 = 0%, Fp = 100% It still makes sense to present relative frequency distributions for quantitative variables as cumulative relative frequencies ▶ 25% of students engaged in an undergraduate program got a grade at least equal to 14 (/20) ▶ 96% of the working force earn less than €6,000 per month (net, before tax) ▶ Click here for your future position in the cumulative relative frequency distribution of earnings A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 34/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.1. Definition The two tables below illustrate how transforming the original table of audit time data presented in classes (on the left) into cumulative frequencies, cumulative relative frequencies and cumulative percent frequencies (on the right) A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 35/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.2. Representing cumulative distributions 3.2.2 Representing cumulative distributions A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 36/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.2. Representing cumulative distributions The representation of cumulative frequencies, cumulative relative frequencies and cumulative percent frequencies requires to identify first the kind of quantitative variable at stake ▶ Depending on whether the quantitative variable is ordinal (e.g. the number of dependent children in a household) or measured with an interval scale (e.g. the daily delay (in minutes) of high-speed trains reaching their destination), the kind of representation will not be the same ▶ Quantitative ordinal variables are discontinuous: there is no element between two successive values of the variable (e.g. the number of dependent children is not defined between 2 and 3) ▶ The cumulative step chart illustrates discontinuity (and so did previously the representation of frequencies and relative frequencies by imposing a space between rectangles) A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 37/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.2. Representing cumulative distributions ▶ Quantitative variables measured with an interval scale are continuous: it is always possible, theoretically at least, to find a value between two successive values ▶ The cumulative diagram illustrates continuity (and so did previously the representation of frequencies and relative frequencies by imposing no space between rectangles) ▶ The cumulative diagram is very useful in order to identify graphically the quantiles of a distribution (e.g. deciles, quartiles, and the median) A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 38/42 3. Summarizing data for a quantitative variable 3.2. Cumulative distributions 3.2.2. Representing cumulative distributions In the two examples (number of dependent children in the household and daily delay of high-speed trains), the cumulative step chart and the cumulative diagram have been drawn from distributions in cumulative relative frequencies, but they could have been also drawn from cumulative frequencies If the cumulative diagram of the distribution of cumulative (relative) frequencies in ascending order of the values of the variable and the cumulative diagram of the distribution of cumulative (relative) frequencies in descending order of the values of the variable are represented in the same orthonormal coordinate system, the intersection point of the two diagrams corresponds to the median of the distribution A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 39/42 4. Glossary 4. Glossary A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 40/42 4. Glossary Bar chart A graphical device for depicting categorical data that have been summarized in a frequency, relative frequency, or percent frequency distribution. Categorical data Labels or names used to identify categories of like items. Cumulative frequency distribution A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper class limit of each class. Cumulative percent frequency distribution A tabular summary of quantitative data showing the percentage of data values that are less than or equal to the upper class limit of each class. Cumulative relative frequency distribution A tabular summary of quantitative data showing the fraction or proportion of data values that are less than or equal to the upper class limit of each class. Data visualization A term used to describe the use of graphical displays to summarize and present information about a data set. Dot plot A graphical device that summarizes data by the number of dots above each data value on the horizontal axis. A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 41/42 4. Glossary Frequency distribution A tabular summary of data showing the number (frequency) of observations in each of several nonoverlapping categories or classes. Histogram A graphical display of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the class intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. Percent frequency distribution A tabular summary of data showing the percentage of observations in each of several nonoverlapping classes. Pie chart A graphical device for presenting data summaries based on subdivision of a circle into sectors that correspond to the relative frequency for each class. Quantitative data Numerical values that indicate how much or how many. Relative frequency distribution A tabular summary of data showing the fraction or proportion of observations in each of several nonoverlapping categories or classes. Stem-and-leaf display A graphical display used to show simultaneously the rank order and shape of a distribution of data. A. Paraponaris Chapter 2: One-Variable Descriptive Statistics: Tabular and Graphical Displays 42/42

Use Quizgecko on...
Browser
Browser