Zayed University Reasoning with Data Lecture Notes PDF
Document Details
Uploaded by PositiveAstatine
Zayed University
2024
Tags
Related
Summary
These lecture notes from Zayed University cover various methods of data visualization, including frequency tables, bar graphs, pie charts, and infographics, for both qualitative and quantitative data. The notes introduce key concepts in data analysis and provide examples of how to use data visualization tools like Excel.
Full Transcript
Reasoning with Data FYE-110 Lecture Notes Department of Mathematics and Statistics College of Natural and Health Sciences Zayed University Fall 2024 Contents 1 Presenting Data: Visualizations...
Reasoning with Data FYE-110 Lecture Notes Department of Mathematics and Statistics College of Natural and Health Sciences Zayed University Fall 2024 Contents 1 Presenting Data: Visualizations 1 1.1 Organize Qualitative Data................................ 1 1.1.1 Frequency Tables................................. 1 1.1.2 Bar Graphs.................................... 5 1.1.3 Pie Charts..................................... 13 1.1.4 Infographics.................................... 18 1.2 Organize Quantitative Data............................... 26 1.2.1 Frequency distributions.............................. 26 1.2.2 Histogram..................................... 29 1.2.3 Line plot...................................... 34 1.2.4 Radar plot..................................... 38 1.3 Graphical Displays in Media............................... 42 1.4 Graphs, Good and Bad.................................. 46 Unit 1: Presenting Data: Visualizations Introduction In the PPDAC cycle (Problem, Plan, Data, Analysis, Conclusion), the analysis phase often begins with understanding the data. A crucial part of this process is “eyeballing” the data, which involves exploring it visually through graphs, tables, and charts. This initial visual exploration helps you get a feel for the data and guides deeper analysis. Being able to read, interpret, and reason about data visualizations like tables, graphs, and info- graphics is an essential skill. In today’s media and online platforms, data is represented using various engaging and interactive techniques. These go beyond traditional graphs such as pie charts and bar graphs, which most learners encounter in their first year of university. In this unit, you will learn how to use software tools like Excel and other supporting packages to create accurate and informative tables, charts, and diagrams. The goal is to select the best visual representation for your data. To choose the most appropriate graph, it’s essential to identify the type of data you’re working with. In Unit 1, we learned about two main types of variables: qualitative and quantitative. Qualitative variables can be further divided into nominal and ordinal types, while quantitative variables can be continuous or discrete. This chapter will focus on how to visually represent these different types of variables, ensuring that the visualizations convey the data’s meaning accurately and effectively. 1.1 Organize Qualitative Data Qualitative data, which captures categories or qualities, is best visualized using graphs that em- phasize comparisons between groups or categories. Common methods include bar charts and pie charts, which allow for clear, visual representation of frequencies and proportions, making it easier to identify patterns and draw insights from categorical data. Effective visualization helps in quickly communicating the key features of the data without overwhelming the audience. 1.1.1 Frequency Tables When working with large datasets, it’s essential to organize the information in a way that reveals patterns and trends. One effective method is to group the data based on the number of occurrences of each category or class. This leads to what we call a frequency distribution, which helps to summarize the data for easier analysis. Here are some definitions you need to be familiar with: 1 Definition 1.1: Frequency Distribution - A frequency distribution is a tabular summary of data showing the number absolute frequency of observations in each class, and the summary table that presents the data is called frequency distribution. - The relative frequency for each class is the frequency of the class divided by the total number of observations. Absolute Frequency Relative Frequency = Total Number of Observations - It is useful to express the relative frequency as a percentage, we multiply the relative frequency by 100 and put % sign at then end. There are a few important points to keep in mind when working with distribution table: 1. Excel is a very useful tool for constructing a frequency distribution, especially when working with large sample sizes. 2. The last (total) row is optional, but it serves as a good check for your calculations. The frequencies should sum to the total number of observations, the relative frequencies should sum to 1.00, and the percentages should add up to 100%, subject to minor rounding errors. 3. There is no strict rule for ordering the classes. You can arrange them alphabetically or from highest to lowest, depending on what makes the most sense for your analysis. A frequency distribution for a categorical data set is illustrated in the next example. Example 1 A recent survey revealed that there were 200 traffic violations committed by university students over the past year. 2 Violations Frequency Relative Frequency Percentage 65 Speeding 65 200 = 0.325 0.325 × 100 = 32.5% 36 Using Cell Phones 36 200 = 0.18 0.18 × 100 = 18% 32 No Signals 32 200 = 0.16 0.16 × 100 = 16% 25 Tailgating 25 200 = 0.125 0.125 × 100 = 12.5% 25 Slow speed 25 200 = 0.125 0.125 × 100 = 12.5% 8 Bright Lights 8 200 = 0.04 0.04 × 100 = 4% 9 Using Two Parking Spots 9 200 = 0.045 0.045 × 100 = 4.5% 200 Total 200 200 =1 100% Before we proceed, how can the data on traffic violations help in designing a new awareness campaign on safe driving within the university?Let’s revisit the PPDAC cycle, which serves as the main process for guiding this analysis. Figure 1.1: PPDAC for safe driving campaign You can use the relative frequency column, to recognize the following observations: - Speeding violation was highest violation drivers had committed. - Second most common traffic violation was using phone while driving. - Tailgating and Slow speed have same proportion of violations. 3 Thus, from this simple collation, a preliminary conclusion is that we should generally recommend the campaign should focus on two major issues: speeding, which is the most frequent violation, and using cell phones while driving, which ranks second. Try it Yourself 1 You are tasked with analyzing tourism data to create a promotional poster highlighting key destinations in the UAE. Using the PPDAC cycle. A systematic random sample of visitors to UAE during last month was obtained, and the favorite destination of each visitor is given in the following table: Malls AlAin Zoo Louver Museum AlAin Zoo Sir Baniyas Island AlAin Zoo Malls Qasr Al Watan Yas Island Louver Museum Louver Museum Qasr Al Watan Malls Sir Baniyas Island Louver Museum Qasr Al Watan Yas Island Qasr Al Watan Yas Island Yas Island Louver Museum Qasr Al Watan Malls Louver Museum Sir Baniyas Island Based on the analysis, draw conclusions about which attractions should be prominently featured in the poster. You might consider focusing on the destinations with the highest frequencies. Destinations Frequency Relative Frequency Total Give few observations on the table above. - - - Suggest a Poster Present your findings in an engaging way. Consider including visuals, like various photos or highlights of key locations. 4 1.1.2 Bar Graphs As an alternative to a table, we can prepare bar graphs easily. There are different softwares to create a plot that presents the data. Bar graphs (or bar Chart) is an important statistical tools and it is widely used to understand and present in median and research. You will learn how to demonstrate a clear understandings when reading and interpreting bar graphs. Example 2 When Bayanat.ae published dataset on blood unit usages by blood group. It is useful and more informative to present it as bar graph. The blood types are presented on horizontal axis (x-axis), and each bar’s height corresponds to number of units used in public hospitals. We can see from this graph that blood type O+ is significantly more used than other blood types. When you could use a bar graph? We begin with an exploration of some common use cases for bar charts. There are also many types of bar charts, demonstrating the versatility of this chart type, and it is important to choose the right bar chart for the right context. 1. Comparisons between categories or classes: Bar graphs can be used to display data in a ranked order, either from highest to lowest or vice versa. Sorting the bars in this way helps emphasize the largest or smallest values in your data, making it easy to identify top performers or outliers. 2. Ranking: Bar graphs are particularly effective when comparing different categories. Our visual percep- tion is highly attuned to comparing the lengths of bars, especially when they share a common baseline, making this a great choice for showing differences between groups. 3. Grouping of quantitative variables When working with quantitative data that is not qualitative in nature, you can group the values into distinct categories. For example, aggregating data by time periods (e.g., dividing 5 data by quarters like 2024-Q1, 2024-Q2, etc.) allows you to represent trends over time while maintaining clear separation between the groups. Example 3 - In 2022, close to 700,000 pupils from 81 OECD Member and partner economies, repre- senting 29 million across the world, took the PISA test. PISA scores act as a metric to compare quality, equity and efficiency in learning outcomes across countries. Figure 1.2: The best performing countries and economies Findings The graph shows uniform performance among the top 10 countries in PISA test. The data is arranged in ascending order from highest to lowest. Horizontal bar chart The horizontal bar chart is a variation of the bar graph that uses horizontal bars to represent quantitative variables. The length of each bar is proportional to the value it measures. In this chart, the variable is plotted on the vertical axis(y-axis), and the numerical variable is on the horizontal axis (x-axis). This format is ideal for displaying long category labels and showing rankings, as it avoids label overlap that can occur in vertical bar charts. While vertical bar charts are often the default, horizontal bar charts are more effective when dealing with lengthy labels, as they maintain clarity and readability without the need for rotation or shifting. Example 4 World Population Review reports the average school day length by Country 2024. The length of the school day varies significantly across countries, ranging from 4.5 to 10 hours. Understanding these differences is essential as it sheds light on the diverse educational practices and cultures worldwide. 6 Figure 1.3: Average School Day Length (Hours) Findings These figures represent the average start and end times and the total duration of the school day, excluding lunch, recess, and other breaks that form an integral part of the students’ daily routine. This variation in school day lengths underscores the diversity in educational approaches and priorities across different countries. Best practices for using bar charts Maintain rectangular forms for your bars It’s important to avoid altering the shape of the bars in ways that can confuse the reader. For example, rounding the tops of bars, rather than keeping them flat, makes it difficult to determine the exact value. The viewer might not know whether to read the value from the top of the rounded cap or somewhere lower. Similarly, avoid using 3D effects, as these distortions can make it hard to measure bar lengths and may cause baselines to appear misaligned, further confusing the viewer. Figure 1.4: Keep Columns in 2D Format and Flat 7 Double Bar Chart A side-by-side may be used to compare categorical data obtained from two (or more) different sources or groups. Example 5: Double Bar Chart In the 27th Annual CEO Survey: Middle East findings report. Findings Middle East shows higher expectations for the positive impact of generative AI compared to the global average across all categories. The graph shows that CEOs in the Middle East are much more optimistic about the potential of GenAI to improve efficiency at work than their global counterparts. Specifically, 77% of Middle East CEOs believe that GenAI can improve their work efficiency, compared with just 59% globally. 8 Using Technology How to draw Bar Graphs using Excel: 5. Add Axis Titles for clarity. 1. Select the range A1:A7, hold down CTRL, and select the range C1:D7. 2. On the Insert tab, in the Charts group, click the Column symbol. 3. Click Clustered Column. 6. Adjust colors, the layout to make your graph clearer. 4. Result: 9 Case Study 1: Blood Donation Practices In 2024, a group of researchers investigated the ”Blood Donation Practices and Awareness of Blood Types Among Adults in the United Arab Emirates: A Cross-Sectional Community- Based Study”, the aim of this study is to evaluate the effectiveness of existing initiatives and propose targeted strategies to recruit new donors and retain existing ones, with a specific emphasis on less common or critical blood types like O-negative here. A total of 259 participants were interviewed. The distribution of blood types, as revealed by the blood bank’s lab analysis, is presented in following table: Blood Type Frequency Percentage A+ 58 22.4% A- 10 3.9% B+ 51 19.7% B- 9 3.5% AB+ 16 6.2% AB- 1 0.4% O+ 108 41.7% O- 6 2.3% Total 259 ≈ 100% 1. Provide an appropriate graph for the data. 2. Since the aim is to propose strategies to recruit new donors and retain existing ones, with a specific emphasis on less common or critical blood types like AB- and O-. The graph will work nicely when presented to the public encouraging individuals to donate more often. 10 Try it Yourself 2 Prevalence of Obesity in the UAE 2017 -2023, published by bayanat.ae The dataset reports the percentage of the adult population with a body mass index of 30 kg/m2 or higher in the United Arab Emirates. The dataset is classified by gender, Emirate, nationality, and age group. The data covers the period from 2012 to 2023. Row Labels Females Males Grand Total 18-29 25% 23% 24% 30-44 40% 31% 35% 45-59 47% 33% 40% 60+ 63% 29% 46% All Ages 37% 29% 33% Grand Total 42% 29% 36% Table 1.1: Average of Value by Age Group and Gender a. Write a purposeful question based on this data. b. Give an answer to your question. c. Graph the data to show your reasoning. Try it Yourself 3 The graph below describes full Year GDP Growth in the United Arab Emirates. 11 Based on the graph answer the following: 1. How would you describe the average GDP in 2023 compared to 2022? 2. How would you describe the average GDP in 2020? 3. Describe the overall trend of the GDP over the period between 2012 - 2023. 12 1.1.3 Pie Charts Pie charts are often used in research and media when the goal is to show how a whole is divided into its component parts. They are particularly useful when the relative size or proportion of categories needs to be quickly and visually understood. Unlike bar graphs, which are better for comparing individual categories, pie charts provide an immediate sense of how each part contributes to the whole. Definition 1.2: Pie Chart A pie chart is another graphical representation of a frequency distribution for nominal cate- gorical data. Pie chart shows how a whole is divided into parts and can use either proportions or percentages. Why Use Pie Charts Over Bar Graphs: Proportional Focus: Pie charts emphasize proportions rather than exact numbers, making them ideal for showing percentages or shares of a total, like revenue share, population breakdown, or budget allocations. Figure 1.5: Google Revenue in 2024-Q2 Example 6: Land Use In UAE by Crop Type, 2022 According to open data on bayanat.ae, the dataset below shows irrigated land use (Unit: donum) in the United Arab Emirates classified as per the crop type, for the year 2022. Year Item Value 2022 Fruit trees 413196.8 2022 Vegetables 64913.6 2022 Field Crops & Fodders 99644.5 2022 Forest trees 23976.5 2022 Other land use 186391.6 2022 Fallow 339282.2 2022 Total 1127405.2 A pie chart is more useful in this scenario than bar graph in showing the relative proportions 13 of land use in an easily understandable and visually appealing way. Some comments on Pie Chart - The pie chart shows that fruit trees constitute the largest portion of land use by crop in the UAE, accounting for 37%. - Vegetables and field crops together make up only 15% of the land use (6% for vegetables and 9% for field crops). - It’s notable that 30% of the land is fallow. This can be compared with other years’s statistics to study the progress of the land use. - Diverse Land Use With 16% of the land dedicated to other uses and 2% to forest trees, the pie chart reflects a diverse approach to land management. Example 7: Pie Chart in Media The chart shows a breakdown of different causes of death related to overweight or obesity, such as coronary heart disease, stroke, diabetes, neoplasms (cancer/tumour), and other chronic noncommunicable diseases. Each segment of the chart represents a specific cause of death and is labeled with the number of deaths associated with that cause. Source: IHME (Institute for Health Metrics and Evaluation), 2024. 14 Remark The pie chart could be improved by including the percentages along with the fre- quency of each category. Percentages make it easier for viewers to compare the relative sizes of each segment, viewers don’t need to perform mental calculations to understand the proportions. Try it Yourself 4 Pie charts and bar graphs can show the distribution (either counts or percentages) of highest level of education among people of age 25 or over. Figure 1.6: Pie chart and Bar graph of the distribution of highest level of education among persons aged 25 years and over in 2017. 15 - What is the type of the variable? - How many possible values of the variable? - What is the percentage of people age 25 or over have a bachelor’s degree but not an advanced degree? - Give a short description of both graphs? Example 8: Arab Youth Survey 2023: Social media causing decline in mental health Most young people in the Arab world believe that social media is having a negative effect on their mental health. This was one of the key findings in the latest Arab Youth Survey 2023, which revealed that more than 60 percent of young Arabs think social media use is contributing to a decline in their mental well-being. The figure below shows the percentages of respondents asked whether they agree or disagree with the statement: ’I share news on social media without checking its accuracy.’ Figure 1.7: Results of the Arab Youth Survey 2023: Social media causing decline in mental health Here are few questions help you to have insighnts about the ifnormation provided from within the graph. 1. Which region has the highest percentage of respondents who strongly/somewhat agree with sharing news on social media without checking its accuracy? Answer: The Gulf region has the highest percentage of respondents who strongly/somewhat agree, with 58%. 2. Compare the percentage of respondents who neither agree nor disagree across the four groups. Which region has the lowest percentage? Answer: The Levant region has the lowest percentage of respondents who neither agree nor disagree, with 31%. Among all respondents, it is 27%, while the Gulf and North Africa have 21% and 22% respectively. 16 3. Which region has the highest percentage of respondents who neither agree nor disagree, and what does this indicate? Answer: The Gulf region has a slight advantage over the other regions, but the difference is not significant. This indicates that all studied regions have approximately similar percentages of respondents who are uncertain about their stance on this issue. Using Technology Using Excel to draw a Pie Chart 1. Select the range A1:D2. 2. On the Insert tab, in the Charts group, click the Pie symbol. 3. Click Pie. Result: 17 1.1.4 Infographics In this section, we will explore the power of infographics in data visualization. Research shows that people tend to remember information better when they see it, rather than just read or hear it. Infographics combine data with visuals, making complex information easier to understand and more memorable. To make data resonate with your audience, turn it into something they can not only see but also engage with—this is the essence of an effective infographic. Let’s dive in and see how visuals can transform how we communicate data! Definition 1.3: Infographic An infographic is a way of communicating information graphically. It has a combination of texts, images, icons and data representation (graphs). Why Are Infographics So Effective? Infographics enhance audience engagement in media by providing a concise overview of a topic through visually representing information, and they contribute to data visualization and story- telling by simplifying complex information and facilitating explanations of complex topics. The key design principles for creating effective infographics include good design, attractive colors, and consistent flow of ideas. The cognitive benefits of using infographics in media communication include improved recall and comprehension of data, and greater levels of issue-relevant thinking. Example 9: Teens, Social Media, and Privacy What effects does social media activity have on teenagers? In 2012, 802 teenagers aged 12-17 participated in the following survey and interviews, which were conducted in both English and Spanish between July 26 and September 30.Explore what teens choose to share and keep private and with whom in the following infographics 18 Tools to Make Infographics There are a lot of online tools and softwares to help you make infographics. Some are template driven (e.g. Piktochart, Easel.ly) to allow you to work within an established “look and feel” and others are tools you are already familiar with (e.g. Powerpoint). Take a look at some tools on our scale of easy to more difficult to use, depending on your level of skill and interest: Figure 1.8: An infographic illustrating how visual tools enhance data communication. Image source: Creating Infographics But before you just load up some chart maker program, there are rules. Building infographics is an art these days, and if you want to make one that people will stop and read, you have to do it right. Guideline for Beginners to Good Infographic Once you have your data, know your goal, and have visualized your data, the last thing to do is assemble your infographic. This can be complicated and typically requires many elements. If you are choosing to design your own, however, there are a number of things you will want to keep in mind: 19 - Information flow - Color - Font - Shapes A guide for getting started with planning and designing infographics Example 10 In 2018, The National published an article done by Masdar Institute scientists Dr Sanaa Pirani and Dr Hassan Arafat monitored average food waste in 45 hotels. Just 46 percent of lunch buffet food was eaten and 53 percent of food served at iftars. Here are the rest of the findings. Figure 1.9: Source: Journal of Cleaner Productions 20 Avoid these Common Mistakes with Infographics 1. Make one big point Infographics are great for communicating about complex information, but that doesn’t mean they should be overly complex themselves. When the material is complex, it is wise to not oversimplify it, but it is critical to at least clarify it. Figure 1.10: Confusing Inforgraphic, Danny Dorling 2. Use clear title and labels Keep titles succinct and try to use words that are easy for your audience to understand. If you can announce the takeaway from the graph in the title, that really helps the audience. Figure 1.11: Inforgraphic with No Clear Title, CDC Twitter Account 3. Overall Review Take a final look at the bar graph, and ask do the percentages match up to the size of the bars? Is there anything went wrong here? In an online report published on 21 CNN.com this graph was used in an article about how the Girl Scouts are going online to sell cookies. Figure 1.12: Misleading Inforgraphic CNN.com Note, how the percentages do not match the bar sizes, which gives an misleading graphs. Example 11 As we know, water covers about 70% of Earth’s surface, but in terms of volume, it represents only 1/1000th of the planet’s total volume. This infographic from the US Geological Survey visually demonstrates what would happen if all of Earth’s water—from oceans and ice caps to lakes and atmospheric vapor—were combined into a single sphere. Additionally, 97.5% of this water is saltwater, leaving just 2.5% as freshwater, crucial for life on Earth. 22 Figure 1.13: The World’s Water, World Bank Try it Yourself 5 This infographic from the Dubai Statistics Center illustrates Dubai’s economic performance by sector, comparing growth rates and GDP contributions for the first quarter of 2019 and 2020. Figure 1.14: Infographic Dubai Economic Performance 23 The infographic also compares GDP contributions, answer the following based on the info- graphic: - Identify two sectors that maintained consistent GDP contribution between 2019 and 2020. - Compare the growth rates for the financial and insurance sector in 2019 and 2020. What difference can you observe? - Which sector experienced the largest decrease in growth rate between 2019 and the first quarter of 2020, and by how much? Try it Yourself 6 This infographic presents a detailed breakdown of concerns related to COVID-19 in the work- place, providing data through pie charts and bar graphs. Interpret the charts and graphs - What percentage of employees selected anxiety as the emotion they felt most strongly during the last month? - What percentage of employees aged 30-45 are at least somewhat concerned about the return of Covid-19 later this year? - Which age group shows the highest percentage of being very concerned about the return of Covid-19 later this year? - How does the percentage of employees with a household member considered at higher risk vary across different income levels? - Compare the levels of concern about the return of COVID-19 between employees aged 18-29 and those aged 65+. What do you notice? 24 Try it Yourself 7 List all the issues you can identify about the following graph. Figure 1.15: Misleading Graph Fox News 25 1.2 Organize Quantitative Data Since quantitative deals with numerical values and can be represented using graphs that highlight distribution, trends, and variability. These tools provide insights into the shape, center, spread, and outliers of the data. These visualizations help to uncover relationships and patterns that might be hidden in raw numbers, making them crucial for effective data analysis and interpretation.data we can use the following methods to summarize and visualize these data. We will learn about some common tools like: - Frequency distributions - Histogram - Line Plot - Radar Plot 1.2.1 Frequency distributions Remember : A frequency distribution is a table that partitions data into classes or intervals of equal width and shows how many data values are in each class. The frequency, f , of a class is the number of data entries in the class. Constructing a Frequency Distribution: 1. Determine the number of non-overlapping classes. - Use between 5 and 15 classes. - Datasets with a larger number of values usually require a larger number of classes. 2. Determine the width of each class. Largest value – Smallest value Approximate Class Width = Number of classes 3. Determine the class limits (boundaries). - Class limits must be chosen so that each data item belongs to one and only one class (Avoid overlapping or unclear class limits). 4. The analyst uses judgment to determine the combination of the number of classes and class width that provides the best frequency distribution for summarizing the data. Example 12: Computer Repair The price of computer repair can vary greatly by region and repair shop. Hiring a computer repair technician to get you back up and running, you will likely spend between 50 and 110 Dhs. The following sample data set lists the repair price per hour (in Dhs) for 50 repair shops in UAE. 26 91 78 93 57 75 52 99 80 97 62 71 69 72 89 66 75 79 75 72 76 104 74 62 68 97 105 77 65 80 109 85 97 88 68 83 68 71 69 67 74 62 82 98 101 79 105 79 69 62 73 Construct a frequency distribution using 6 classes. - Sort the data in ascending order: 52 57 62 62 62 62 65 66 67 68 68 68 69 69 69 71 71 72 72 73 74 74 75 75 75 76 77 78 79 79 79 80 80 82 83 85 88 89 91 93 97 97 97 98 99 101 104 105 105 109 109−52 - Class width= 6 = 9.5 ≈ 10 - Determine class boundaries and assign observations to classes: Repair Cost(Dhs) Frequency 50 but less than 60 2 60 but less than 70 13 70 but less than 80 16 80 but less than 90 7 90 but less than 100 7 100 but less than 110 5 Total 50 - Calculate the relative frequency and the percentage of each class: Relative Percent Repair Cost(Dhs) Frequency Frequency Frequency 50 but less than 60 2 0.04 4% 60 but less than 70 13 0.26 26% 70 but less than 80 16 0.32 32% 80 but less than 90 7 0.14 14% 90 but less than 100 7 0.14 14% 100 but less than 110 5 0.1 10% Total 50 1 100% Insights Gained from the percent frequency distribution: - Only 4% of the repair costs are in the 50 to less than 60 class. - 30% of the repair costs are under 70 Dhs. 27 - The greatest percentage (32% or almost one-third) of the repair costs are in the 70-79 Dhs range. - 10% of the repair costs are 100 Dhs or more. Try it Yourself 8 A sample of 400 college freshmen was asked how many hours per week they spend on social media platforms. The following frequency distribution presents the results. Number of hours Frequency 1.0-3.9 50 4.0-6.9 68 7.0-9.9 106 10.0-12.9 78 13.0-15.9 63 16.0-18.9 35 a. Construct a relative frequency distribution. b. How many students spend on social media platforms 10 to less than 16 hours per week? c. What percentage of students spend on social media platforms less than 10 hours per week? d. What percentage of students spend 13 or more hours per week? 28 1.2.2 Histogram - Histogram is a graphical display that gives an idea of the shape of the quantitative data distribution which is similar to the bar chart showing the distribution of qualitative data. - The class boundaries (or class midpoints) are shown on the horizontal axis while frequency is measured on the vertical axis. - Bars of the appropriate heights can be used to represent the class frequency (or percent). - Unlike a bar chart, a histogram has no natural separation between rectangles of adjacent classes. A space indicates that there are no observations in that interval. - Look for: Central or typical value, extent of spread or variation, general shape, location and number of peaks, presence of gaps and outliers (unusual extreme points). Describing Histograms: Shape, Center, Variability, and Outliers Histograms are a powerful way to visualize the distribution of data. They allow us to quickly understand the shape, center, variability, and presence of any outliers in the dataset. Let’s explore each of these aspects in detail. Shape The shape of a histogram describes how data is distributed across different values. Common Shapes: - Symmetrical: Data is evenly distributed around a central value. A classic example is the bell-shaped curve, known as a normal distribution. - Skewed Right (Positively Skewed): Most data points are on the left with a long tail extending to the right. This suggests that extreme high values are pulling the distribution. - Skewed Left (Negatively Skewed): Most data points are on the right with a long tail extending to the left, indicating that there are fewer low values pulling the distribution. - Uniform: Data is evenly distributed across all values, giving a flat shape. - Bimodal: The histogram has two peaks or modes, indicating two prevalent ranges of values in the data. Common Distribution Shapes 29 The shape gives insight into the underlying distribution and can help identify patterns like skewness or whether the data has multiple modes. Center The center of a histogram represents a typical value or the ”middle” of the data distribution. Ways to Measure Center: - Mean: The average of the data. It can be pulled in the direction of any skewness or outliers. - Median: The middle value of the data when arranged in order. Unlike the mean, the median is less affected by outliers or skewness. Remark More about central measures will come in unit 4. Variability (Spread) Variability refers to how spread out the data is in the histogram. A wider spread indicates greater variability, while a narrower spread suggests the data is more concentrated around the center. Measures of Variability: Range: The difference between the highest and lowest values. A simple measure of spread. Interpretation: Look for how far the data points are from the center. If most data points are clustered near the center, the variability is low. If the data is more dispersed, variability is high. Outliers Outliers are data points that are significantly higher or lower than the rest of the data. In a histogram, they appear as bars that are separate from the main body of the data. Identifying Outliers: Look for bars that are distant from the bulk of the data. 30 Impact of Outliers: Outliers can skew the data, particularly the mean and standard deviation. In decision-making, it’s essential to determine whether these outliers are the result of measurement errors or reflect real variability in the data. Tips for Describing Histograms When describing a histogram, be sure to mention: - The shape: Is it symmetrical, skewed, or bimodal? - The center: Where is the middle of the data? Does the median or mean better represent it? - The variability: Is the data spread out or clustered around the center? - The presence of any outliers: Are there any extreme values that stand apart from the rest of the data? Example 13: Computer Prices To follow up on our previous example, we constructed the following histogram based on the sample data set. Interpretation: Here is how we would interpret the histogram in terms of our four charac- teristics: Center: The costs per hour of repair are around 80 AED. Variability: The cost may range between 50 AED and 110 AED. Shape: The shape is almost symmetric. Outliers: There are no obvious outliers. Try it Yourself 9 The Wechsler Adult Intelligence Scale (WAIS) is an IQ test. A recent version of this test was administered to a sample of n = 2200 Americans (aged 16-90). The histogram of the simulated 31 IQ score data is shown figure. How would you describe the histogram? Center: Variability: Shape: Outliers: Try it Yourself 10 The following figure displays the long-jump distances (in meters) for 40 male athletes partic- ipating in the 2012 Summer Olympics in London. There were 42 male athletes, but 2 were disqualified. The longest jump for each athlete is shown. How would you describe the histogram? Center: Variability: 32 Shape: Deviations: Exercise 1. The following histogram shows the distribution of serum cholesterol level (in milligrams per deciliter) for a sample of 200 men. Use the histogram to answer the following questions: a What percentage of men with cholesterol levels above 240? b In which interval are there more men: 240–260 or 280–340? c How many men with cholesterol levels between 260 and 280? d Is the histogram most accurately described as skewed to the right, skewed to the left, or approximately symmetric? 2. The graphs below show the distribution of first-time mothers’ age with its percentages of all mothers. a For each of the histograms, what do you notice? Please make a list of 3 detailed observations for each graph (1980, 2016). b For these histograms, what do data is showing? For example, what questions do you have about these graphs? What are you curious about that comes from what you notice about this graph? 33 1.2.3 Line plot Having a well designed research question is a critical beginning to any data driven research problem. While an in-depth discussion on how research questions can be designed is beyond the scope of this course, the following table gives a few examples and provides some insights into what are some considerations and desirable features that good research questions should have. When you should use line chart? You will use a line chart when you want to emphasize changes in values for one variable (plotted on the vertical axis) for continuous values of a second variable (plotted on the horizontal). This emphasis on patterns of change is sold by line segments moving consistently from left to right and observing the slopes of the lines moving up or down. The independent variable is listed along the horizontal, or x-axis and the quantity or value of the data is listed along the vertical, or y-axis. A common use of line chart is to observe quantitative data that are measured at regular intervals over period of time. Data that are observed over time are called time series data. Remark: In general, look for these three characteristics when examining data over time in line graphs. Overall trend. A trend is a long-term increase or decrease over time. Seasonal variation. These are patterns which repeat themselves over time; e.g., each week, each month, each year, etc. Sharp deviations. These are unusual observations that deviate greatly from the overall pattern. Example 14 Below are number of students registered in UAE private and public schools according to MOE, Open-data Academic Year Total 2017-18 254,434 2018-19 226,279 2019-20 509,657 2020-21 790,292 2021-22 929,395 2022-23 1,294,802 2023-24 755,020 34 Analysis: Overall, the trend shows that the number of students registered in UAE schools has seen increasing for the past 7 years. During during academic year 2019-2020, the number of students nearly doubled and kept steadily increasing until 2022-2023, reaching more than 1 million student. Last academic year, the number of students dropped to around 755,000 students. Try it Yourself 11: Occurrences of Earthquakes of 2000 to 2023 In October 2021, the UAE committed to reaching net zero emissions by 2050 through its Net Zero 2050 strategy. In January 2024, the UAE confirmed this goal to the UN, outlining plans for reducing and removing emissions, along with pathways for different sectors to achieve net zero. Describe the overall trend of the CO2 emissions in past 30 years? 35 Multiple Line Graph One way to display data is in a multiple line graph. A multiple line graph shows the relationship between independent and dependent values of multiple sets of data. Usually multiple line graphs are used to show trends over time. Example 15: UAE business activity growth diverges from global trend in 2022 According to a report, the UAE non-oil sector has seen a robust acceleration in business activity growth since the height of the COVID-19 pandemic. Findings A significant growth occurred during Expo 2020, held between October 2021 and March 2022, which had a considerable impact on tourism and new business. The sustained mo- mentum suggests that firms continue to benefit from increased investment and higher economic activity. Another finding is that the growth in the non-oil sector contrasts with a faltering global econ- omy. The UAE business activity index clearly outperformed in 2022, indicating that the domestic non-oil sector has been more resilient to external shocks than most other countries. 36 Case Study 2: Line Chart in Research In Renewable Energy journal, a paper discussing The costs and benefits of large-scale solar photovoltaic power production in Abu Dhabi, United Arab Emirates. This research looks at the costs and benefits of building a 10 MW solar power plant in Abu Dhabi. It compares the city’s highest monthly electricity demand with how much electricity the solar plant is expected to produce. This plot compares the monthly peak electricity demand (in megawatts, MW) and the projected electricity production. Based on the chart above answer the following questions: - In which months the demand surpass the production? - In which month was the peak electricity export occurs? - As per the article what could be the reasons behind the decrease in electricity production from May to August? 37 1.2.4 Radar plot Definition 1.4: Radar Plot A radar chart also known as a spiderweb chart, is another quick type of visualization of comparing different data series that is stacked at axes coming from the same point and uses transparent shades and patterns to highlight contrast for the reader. Radar charts are considered a better alternative to bar charts as they can show different variables (in two-dimensional graph) and series easily without creating any confusion. Radar charts has a few basic elements. - Center point: The central point of a radar charts, from which different axes are generated. - Axis: Each axis represents a variable in a radar chart. You can build a radar chart with at least three axes. - Grids: When axes are linked in a radar chart, it divides the entire graph into different grids that help us represent information in a better way. - Values: Each axis, we have different colors that represents various values Once the graph is drawn, we represent various values on each axis and plot the chart for every entry by allocating distinctive colors. When to use a radar chart Radar charts are particularly useful when: 1. You want to compare multiple data sets or individuals across a set of variables. 2. The variables are measured on a similar scale (e.g., percentages, ratings). 3. You want to visualize the overall profile or pattern of data points rather than specific values. 4. You need to make comparisons across products or services 5. You want to identify outliers or extreme values. 6. You want to present data in a visually appealing and easy-to-understand format. Limitations and when not to use a radar plot: Radar plot is not useful in these situations: - Not suitable for large datasets: As the number of variables or categories increases, radar plots can become cluttered and hard to read. - Not ideal for precise value comparisons: Other charts like bar plots or line charts may be better when you need precise comparisons. - Potential for misinterpretation: If scales on different axes are not uniform, it can distort perceptions. 38 Example 16: Comparison between Female and Male Performances Assume you are a manager in an IT company and you want to assess the overall performance of two employees. You can compare the performance of both employees using a radar plot. To compare them, you have used variables such as punctuality, communication skills, technical knowledge, teamwork, etc. Furthermore, rating (1-5) are assigned to each of these categories for both employees and the results are compared accordingly. Try it Yourself 12: Students’ Performance Examine the radar plot comparing the performances of Student A and Student B across various subjects. - List the subjects where Student A performs significantly better than Student B. - List the subjects where Student B performs significantly better than Student A. - Give a general comments on both student’s performances. 39 Case Study 3: What the Language You Tweet Says About Your Occupation Many aspects of people’s lives are proven to be deeply connected to their jobs. In this paper, the researchers investigated the distinct characteristics of major occupation categories based on tweets. Eight job categories are extracted, including Marketing, Administrator, Start-up, Editor, Soft- ware Engineer, Public Relation, Office Clerk, and Designer. The five dimensions are: - Openness: measures how open a person is to unusual ideas, imagination, curiosity, and variety of experience. In other words, a higher openness indicates a higher acceptance to new things and changes. - Conscientiousness: reveals a person’s self-discipline. People of higher conscientiousness tend to act in an organized and thoughtful way. - Extraversion: indicates the extent to which a person prefers or enjoys being in social situations or interactions with the outside, and have company of others. - Agreeableness: reflects if a person feels comfortable about compromising. Higher agree- ableness implies being more cooperative toward others. - Neuroticism: measures the instability of a person’s emotions. It is usually easier for a person of higher neuroticism to experience negative emotions. 40 Using Technology: How to Create a Radar Chart in Excel Step 1: Organize your data Step 2: Select your data Step 3: Insert the Radar chart Step 4: Make sure the labels are clear 41 1.3 Graphical Displays in Media There are many ways to present data in pictures. The most common are plots and graphs, but sometimes a unique picture is used to fit a particular situation. The purpose of a plot, graph, or picture of data is to give you a visual summary that is more informative than simply looking at a collection of numbers. Done well, a picture can quickly convey a message that would take you longer to find if you had to study the data on your own. Done poorly, a picture can mislead all but the most observant of readers. Here are some basic characteristics that all plots, graphs, and pictures should exhibit: - The data should stand out clearly from the background. - There should be clear labeling that indicates – the title or purpose of the picture. – what each of the axes, bars, pie segments, and so on, denotes. – the scale of each axis, including starting points. - A source should be given for the data. - There should be as little “chart junk”—that is, extraneous material—in the picture as possible. Example 17: What’s Going On in This Graph? For the last three years, New York Times have collaborated with the American Statistical Association (A.S.A.) to produce “What’s Going On in This Graph?”. The following graph presents the common injuries for boys in popular high school sports, showing the number of injuries per 10,000 competition plays. Note how the circle size is proportional to the number of the injuries recorded on that body part. You can also note that head injuries are the most common type among all sports except for basketball. What other insight you can write on that graph? 42 Data Map A data map is a visual representation that displays data geographically, using a map as the backdrop to show patterns, trends, and relationships in a specific location. This technique allows for a clearer understanding of how data is distributed spatially, making it easier to analyze regional differences, identify hotspots, and gain insights that might be overlooked in non-geographic data presentations. Data maps are commonly used in fields like geography, economics, urban planning, and environmental science to help users make informed, location-based decisions. Example 18: Country already reached SDG target on clean cooking fuels? One of the targets of the UN Sustainable Development Goals (SDGs) is to ensure universal access to affordable, reliable and modern energy services. Here, this is shown as the share of the population with access to clean fuels for cooking and heating. 43 Try it Yourself 13: Hotter Summer In another article, there are two graphs were previously published in. They are two of the five graphs that display the Northern Hemisphere summer land temperatures for periods from 1950 to 2023. How have average summer land temperatures across the Northern Hemisphere changed over the past 72 years? Source 44 Using Technology: www.gapminder.org/ Gapminder is a free tool designed to provide a fact-based worldview. It is used by millions of people, including teachers, journalists, and decision-makers worldwide. Watch this tutorial, which will show you how to use Gapminder. Gap Minder can show single variable Gap Minder can animate two variable For example this tool will show the patterns be- tween GDP Per Captia vs Life Expectancy. Gap Minder teaches you a lot about the world This website contain many videos help you learn more from data. In this short video Professor Hans Rosling shows that people live longer in countries with a high GDP per capita. No high income countries have short life expectancy, and no low income countries have long life expectancy. Still, there is a huge difference in life expectancy between countries on the same income level, depending on how the money is distributed and how it is used click here. 45 1.4 Graphs, Good and Bad A number of common mistakes appear in plots and graphs that may mislead readers. If you are aware of them and watch for them, you will substantially reduce your chances of misreading a statistical picture. The most common problems in plots, graphs, and pictures are ➣ No labeling on one or more axes ➣ Not starting at zero as a way to exaggerate trends ➣ Distorting time series plots ➣ Change(s) in labeling on one or more axes ➣ Misleading units of measurement ➣ Using poor information. Example 19: Adjusting the labels A car manufacturer’s ad stated that approximately 98% of the vehicles it had sold in the past 10 years were still on road. The ad then showed a graph similar to the one in Figure 1. The graph shows that percentage of the manufacturer’s automobiles still on the road and the percentage of its’ competitor’s automobiles still on the road. Is there a large difference? Figure 1 Graph of Automaker’s Claim using a Scale from 95 to 100%. It has been cut off and starts from 95%. When the same data redrawn using a scale that goes from 0 to 100%, as in Figure 2, there is hardly a noticeable difference in the percentages. Thus ,changing the units at the starting point on the y-axis can convey a very different visual representation of the data. Moreover, the y-axis isn’t labeled, thus could mislead readers to interpret the data as the number of vehicles. 46 Example 20: How is this misleading? Figure 1.16: Misleading Pie Chart on US Smart Phone Market During the Macworld 2008 keynotes, Steve Jobs presented a Pie chart presenting the US smart phone market share. Do you recognize any issues in Steve’s chart? Remark: Wired Magazine published a justification for this issue, illustrating that it was a perspective issue, not an intentional one. A Checklist for Statistical Pictures To summarize, here are 12 questions you should ask when you look at a statistical picture—before you even begin to try to interpret the data displayed. 1. Does the message of interest stand out clearly? 2. Is the purpose or title of the picture evident? 3. Is a source given for the data, either with the picture or in an accompanying article? 4. Did the information in the picture come from a reliable, believable source? 5. Is everything clearly labeled, leaving no ambiguity? 6. Do the axes start at zero or not? 7. For time series data, is a long enough time period shown? 8. Can any observed trends be explained by another variable, such as increasing population? 9. Do the axes maintain a constant scale? 47 10. Are there any breaks in the numbers on the axes that may be easy to miss? 11. For financial data, have the numbers been adjusted for inflation and/or seasonally adjusted? 12. Is there information cluttering the picture or misleading the eye? Example 21: Misleading Graphs in Data On Global Media Insight webpage, some graphs are not constructed properly, showing a mis- leading graphs. For example, this one shows the population of major cities in Saudi Arabia. Can you identify the issue with this graph? Figure 1.17: Misleading Chart on Population in Main Cities in Saudi Arabia 48