Topic 2b Descriptive Stats Grouped Data (Student Notes) PDF

Summary

This document is student notes for a statistics subject (likely business statistics) on descriptive statistics and grouped data. It discusses organizing data using frequency distributions, contingency tables, and graphical representations. The document also provides an example of computing a grouped mean.

Full Transcript

2b | Data Organisation and Presentation (Descriptive Stats – Grouped) Describing Data with tables and graphical presentations Intended Learning Outcomes: ❖ Organise data into a frequency distribution, contingency tables and any other appropriate graphical representation methods. ❖ Organise a give...

2b | Data Organisation and Presentation (Descriptive Stats – Grouped) Describing Data with tables and graphical presentations Intended Learning Outcomes: ❖ Organise data into a frequency distribution, contingency tables and any other appropriate graphical representation methods. ❖ Organise a given data set using software application functions into appropriate graphical data presentations. ❖ Summarise the key findings from a set of organised data. Singapore hosts mega concerts and performances annually and performers from Asian to Western, across different genres of music and catering to different age groups have performed here. Imagine yourselves as music festival organisers and you are tasked with curating the perfect line-up for 2025. To understand the profile of your audience, you would probably want to look at some important variables such as: Music genre Age group Preferred language Gender Household income Budget set aside for concerts After you have surveyed your target audience, you could end up with a massive dataset about them. But this raw data will be chaotic, confusing, messy and impossible to plan from. Hence, raw data has no meaning until they are organised into readable form where meaningful conclusions can be made. So first, we need to organise the data. We use tools like contingency tables or frequency distribution tables to help us. BLO1001 Statistics for Business 1|P a g e Contingency Table - summary table that presents data showing the relationship between 2 or more qualitative (categorical) variables. Example With the massive data set collected from the surveys, you can group the data into a contingency table. For example, you want to find out how many respondents like a specific music genre in a particular language and these are qualitative variables. The data collected for these 2 qualitative variables can be organised in a contingency table so you get an overview of the relationship between respondents’ favourite music genre and their preferred language for songs. The number in each "box" or cell shows the frequency counts of music- lovers who enjoy a particular combination of music genre in a certain language. Frequency Distribution Table - groups data into categories/classes, showing the number of quantitative observations in each category/class. Example Suppose you want to find out how much your target audience would be willing to pay for a concert ticket and this is a quantitative variable. The lowest amount in the survey was $50 and the highest amount was $590. So, we group the variable into a range referred to as ‘classes’ or ‘categories’. The data collected is then organised into a frequency distribution table which summarises the number of respondents who would be willing to pay a specific budget for performances. BLO1001 Statistics for Business 2|P a g e As seen above, the frequency distribution table provides us a simplified view of the data. You can see most of the survey respondents (observations) would be willing to pay between $100 and $299 for a concert ticket. But is this analysis sufficient? Probably not, as price is usually not the only determining factor. As a music festival organiser, you will still need to dive into analysing the other variables that you have collected. But the frequency distribution table gives an initial understanding about the data. With further analysis, we will be able to understand the: range or spread of the data. concentration of the data, and shape of the distribution1 Constructing a Frequency Distribution Table How do we create a frequency distribution table and decide how many classes or categories we need? Step 1 Data collection Collect the data that you need. Refer to Week 1 during Introduction to the COPAI process. Step 2 Decide the Number of Classes Classes are the categories or "bins" into which you will sort your data. Generally, you should keep between 5 to 20 classes. That's because having too many or too few classes provide little new insights to your data. Step 3 Determine the Class Interval Width To determine the class interval width, you subtract the lowest value in your dataset from the largest value, and divide it by the number of classes that you decide in Step 2. Formula = (𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒) ÷ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑎𝑠𝑠𝑒𝑠 Remember, it's a good practice to use intervals that are easy to understand and visualise, e.g. multiples of 5, 10 etc. Step 4 Determine Class Limits Starting with the lowest value in your dataset, add the Class Interval Width in Step 3 to get the upper limit of the first class. For the next class, the lower limit is the upper limit of the previous class. Continue until all classes are created. Step 5 Tally your Data and Find the Frequencies At this stage, go through your data and count the frequency (occurrences) for each class. Be sure to check the total tally to ensure it matches with the total number of observations. 1 Recall Skewness that you learnt in Week 2 when we were Describing Data using central tendency and dispersion. BLO1001 Statistics for Business 3|P a g e Example You have surveyed 100 classmates on the number of texts they sent in a day. The lowest is 10 texts and highest is 88 texts. To create a Frequency Distribution Table, Step 1 involves collecting data and this is done. Step 2 Select number of Classes: Let us choose 8 classes. Step 3 Decide Class Interval Width: The range is 78 (88-10) so the width is 78 / 8 = 9.75. But if we use a class interval width of 9.75, it will not be easy to understand the classes due to the decimal points. Hence, we round the class interval width to multiples of 10 which is easier to understand and visualise. Step 4 Class Limits: Start at 10 as the first lower limit and create ranges up to 90. So, each class covers an interval of 10. (Pause : Why do we not start at 0? Check in with our Tutors if need be ) Step 5 Tally the data and find the frequencies: Count the frequencies for each class. Your Frequency Distribution Table may end up looking a little like the below: Texts per Day Frequency 10 up to 20 20 up to 30 30 up to 40 And So On… The frequency distribution table gives us the absolute number of observations (i.e. the “count”) in each category or class. Relative Frequency Distribution and Percentage Very often, we want to know how often an ‘observation’ in the data set occurs in relation to the total number in the data set. This will be referred to as the relative frequency as it gives us the "context" by using the proportions of each class in terms of the whole dataset. Let us use the music festival survey from earlier. For example, we see that an ‘absolute’ number of 429 respondents are willing to pay between $100 and $200 for a concert ticket. But this absolute number does not tell us the relative frequency of respondents who are willing to pay between $100 and $200 compared to the total set of data collected. To get the relative frequency, we take the frequency from each class in the frequency distribution table and divide it by the total number of observations (i.e. survey respondents). BLO1001 Statistics for Business 4|P a g e 𝒇 Relative Frequency = 𝜮𝒇 Relative Frequency Percentage is relative frequency multiplied by 100%. Budget for Concerts ($) Frequency (f) Relative Frequency Relative Frequency Percentage 0 up to 100 151 151 ÷ 1505 = 0.10 10.0% 100 up to 200 429 429 ÷ 1505 = 0.29 28.5% 200 up to 300 431 431 ÷ 1505 = 0.29 28.6% 300 up to 400 184 184 ÷ 1505 = 0.12 12.2% 400 up to 500 198 198 ÷ 1505 = 0.13 13.2% 500 up to 600 112 112 ÷ 1505 = 0.07 7.4% Total 1505 1.00 100% So, from the Relative Frequency Table, we can see that 28.5% of respondents surveyed are willing to spend between $100 to $200 for concerts. Relative frequency is especially useful when comparing datasets of different sizes or when you want to understand the distribution of the data collected in a more standardised manner. This can help reveal patterns and insights that raw numbers alone cannot. Example Relative Frequency Distribution – Comparing 2 Data Sets The school has created an entrepreneurship space for students to set up their business ideas on campus. You are planning to set up a booth selling Japanese specialty snacks and drinks. You surveyed classmates from 2 different classes to better understand your target customers’ profile (i.e. mostly fellow students). Data about the average amount spent per meal was collected and organised in a frequency distribution table for the 2 classes. What can you observe? Class A Class B Amount spent Relative Relative Relative Relative Frequency Frequency per meal ($) Frequency Frequency % Frequency Frequency % 5 but less 2 0.080 8.0% 8 0.308 30.8% than 10 10 but less 6 0.240 24.0% 4 0.154 15.4% than 15 15 but less 7 0.280 28.0% 5 0.192 19.2% than 20 20 but less 3 0.120 12.0% 2 0.077 7.7% than 25 25 but less 1 0.040 4.0% 3 0.115 11.5% than 30 30 but less 2 0.080 8.0% 1 0.038 3.8% than 35 35 but less 3 0.120 12.0% 3 0.115 11.5% than 40 40 but less 1 0.040 4.0% 0 0 0.0% than 45 Total 25 1 100% 26 1 100% BLO1001 Statistics for Business 5|P a g e From this frequency distribution table, we can observe that: The range of amount spent per meal for Class A is bigger. It is between $5 to less than $45 for Class A as compared to $5 to less than $40 for Class B. This suggests that there is greater dispersion in the amount spent in Class A. The modal amount spent for Class A is $15 to less than $20, whereas the modal class is $5 to less than $10 in Class B. For both classes, about 70% of students spend between $5 to less than $25 per meal. The shape of the distribution for both classes is skewed to the right. Frequency Distribution and Computing Grouped Mean We learnt about calculating the Mean in Topic 2a using ungrouped data. How do we compute the mean if we do not have the raw data points? We use the classes of data and the frequency of each class to compute the Grouped Mean. Using the music festival example, we have grouped our data into 6 classes and the number of observations (frequencies) in each class. In this case, the formula to compute the grouped mean is: 𝜮𝒇𝒙 Grouped Mean = 𝒏 where ∑fx refers to total weighted sum and n refers to the total frequency (total number of observations) BLO1001 Statistics for Business 6|P a g e A grouped mean of $262.29 means that, on average, the respondents in this survey are willing to spend about $262 for a concert ticket. It gives you a good idea of the "typical" budget for most of the people surveyed. The grouped mean takes into account the distribution of data across the classes and weighs the contribution of each group by how many data points it contains. However, the computation of the grouped mean is only an estimation as we assume that the data points are uniformly distributed within each class which may not be true. The grouped mean does not consider the exact data points as it uses class midpoints to get the average. This is unlike the mean in ungrouped data which we have learnt in the earlier topic. Graphical Presentation of Numerical Variables Numbers, proportions and percentages are not the only ways to look at statistics. Graphical presentations such as shapes, charts and graphs are also used extensively in statistics to communicate data in an accessible and visually appealing way. They allow us to quickly identify relationships between variables and outliers in the data set. Visuals can also make it easier to present statistical findings to audiences who may not have a background in statistics, making the data not only informative but also engaging. Histogram Bar Graph Good for quantitative continuous data Good for comparing categorical data. (e.g., height, weight, exam scores) Bars in a histogram are placed next to Bars do not touch, representing distinct each other, with no gaps groups or categories (e.g., music genres, Helps to give a sense of the distribution types of movies) and shape of the data. Source: Math Monks (n.d.) BLO1001 Statistics for Business 7|P a g e Example Using the frequency distribution table from the music festival earlier, we are able to construct a graphical representation using a histogram. We can also draw a ‘rough’ distribution curve by connecting the middle of the top edges of all the bars. This immediately provides us with a visual cue on the shape of the distribution. Based on this histogram, we can draw the following observations: Range: The range of the data is from $0 up to $600. Clustering of the data: The majority of the observations fall into 2 modal classes, audience with budgets of $100 up to $200 and budgets of $200 up to $300. Shape of distribution2: The histogram is skewed to the right or positively skewed. Less people are willing to pay for expensive tickets. Other Types of Graphs used to present Numerical Variables Apart from the histogram, graphs such as dot plots, stem-and-leaf display and cumulative frequency graphs are also used to describe data with numerical variables. Dot plots These are similar to histograms but instead of bars, individual dots are used to show the distribution of data across the variables. This can be helpful for small datasets and are able to show the individual data points. (Source: ChartExpo (2024) 2 Recall Skewness that you learnt in Week 2 when we were Describing Data using central tendency and dispersion. BLO1001 Statistics for Business 8|P a g e In the dot plot above, it shows the number of customers entering a store at specific times of the day so each dot represents one observation. The dot plot tells us that the modal time of visit to the store is at 1700 hours (or 5 pm), the shop is the busiest at this time. Stem-and-leaf display The stem-and-leaf display splits each data point into a “stem” and “leaf”. The “stem” is the first digit, whilst the “leaf” is the last digit. It is a simple way of showing how the data is distributed whilst retaining the value of the data point. Example The Residents’ Committee (RC) of Ponggol Graphite estate collected data on the ages of its 30 volunteers. The RC Chairman uses a stem-and-leaf display to understand the distribution of the volunteers’ ages. Below are the results: Stem | Leaf 1 | 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 9 2 | 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 3 | 0, 1, 2, 3, 5 In the above display, The "1" stem represents ages in the teens (10-19) and each leaf represents an individual within that age range. The "2" stem represents ages in the twenties (20-29), with leaves for each age in that range. The "3" stem represents ages in the thirties (30-39), with leaves for each age in that range. This display tells us the frequency of each age and the distribution across different age groups. Cumulative Frequency Graphs The cumulative frequency graph shows us the accumulation of frequencies at or below each class which can be helpful to understand the distribution and relative frequencies. The cumulative frequency graph on the left shows the cumulative frequency curve of the weights of 100 packets of cookies. The curve tells us that about 26 packets of cookies weigh 80 grams or less, and there are no packets of cookies that weigh more than 200 grams. BLO1001 Statistics for Business 9|P a g e Summary In this chapter, we have used statistical tools such as frequency distribution tables and histograms to group the data and describe the range, clustering and shape of the distribution. Frequency distribution tables organise our data into classes, making it easier to see how often specific values occur, which is essential for identifying the most common or rare occurrences within a dataset. Graphs, such as histograms help to provide a visual representation of the data's distribution. Together, these descriptive statistics tools provide a clear, concise way to organise and present data, and communicate findings to both technical and non-technical audiences. Optional: For more real-life examples of how statistics is presented in a graphical manner, please visit SingStats, Department of Statistics Singapore. The sections below are particularly interesting for our syllabus. Source: SingStats (2024) During the tutorial for this Topic, our Tutors will conduct a demonstration to show how we can use MS Excel to generate histograms. End of Topic 2b BLO1001 Statistics for Business 10 | P a g e

Use Quizgecko on...
Browser
Browser