MODULE 5 DATA PRESENTATION PDF

MODULE 5: DATA PRESENTATION Learning Outcome: Create appropriate tabular and graphical displays using R to present and summarize data in a meaningful manner. In this module, you will learn how t...

MODULE 5: DATA PRESENTATION Learning Outcome: Create appropriate tabular and graphical displays using R to present and summarize data in a meaningful manner. In this module, you will learn how to construct some statistical tables and graphs to present collected data in a more meaningful and visual manner. Most of these can be done using Microsoft Excel. However, we focus on the use of the R software in producing these graphs or charts. You are very much encouraged to also learn about the other graphs or charts and how to create these using R and RStudio. Data Presentation and Visualization After the sampling and data collection process, what results is data in its raw format, which is often difficult to understand as is. The next step would now be to summarize and organize these using textual, tabular or graphical forms in order for the researcher or author to be able to impart useful information to the readers. In preparing texts, tables or graphs, we must always be mindful of what information the data are conveying, and what must be done to include more useful information. Planning how the data will be presented is essential before appropriately processing raw data. Data Visualization is a term to describe the use of graphical displays to summarize and present information about a data set. Data become more comprehensible and more useful when they are organized and presented using graphs, frequency distribution tables, charts, diagrams and the like to derive logical solutions and conclusions. Summarizing Qualitative and Quantitative Data for a Single Variable Data obtained from a single variable can be summarized and presented in many ways. A frequency distribution table, a bar chart and a pie chart can be used to present qualitative data. Quantitative data, on the other hand, can be summarized using a dot plot, a stem-and-leaf display, a frequency distribution table, and a histogram. Let us look at each these methods more closely. FREQUENCY DISTRIBUTION TABLE A frequency distribution is a table that shows how often each value (or set of values) of the variable in question occurs in a data set. It is used to summarize categorical (qualitative) or numerical (quantitative) data. Simply put, it is a tabular summary of data showing the Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 55 number or frequency of observations in each of several non-overlapping categories or classes. The relative frequency of a class equals the fraction or proportion of the observations belonging to a class or category.Thus, the relative frequency can be computed using ℎ = A relative frequency distribution gives a tabular summary of data showing the relative frequency for each class. If the relative frequency multiplied by 100, we get the percent frequency of a class.A percent frequency distribution summarizes the percent frequency of the data for each class. Example 1: The raw data in the table below shows fifty soft drink purchases. Notice that there is not so much information that we can get from the data in its current form so it is best to consider other ways to present the data. Let us construct a frequency distribution table for the sample. Fig. 5.1. Sample Data on Soft Drink Purchases The frequency distribution table for this data set can be constructed manually or by using the PivotTable feature of Microsoft Excel. With some editing, the following are the frequency, relative frequency and percent frequency tables generated: Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 56 Soft Drink Type Frequency Soft Drink Type Relative Frequency Coke Classic 19 Coke Classic 0.38 Diet Coke 8 Diet Coke 0.16 Dr. Pepper 5 Dr. Pepper 0.10 Pepsi 13 Pepsi 0.26 Sprite 5 Sprite 0.10 Total 50 Total 1.00 Table 5.1. Frequency Distribution Table for Table 5.2. Relative Frequency Distribution Table Soft Drink Purchases for Soft Drink Purchases Soft Drink Type Percent Frequency Coke Classic 38% Diet Coke 16% Dr. Pepper 10% Pepsi 26% Sprite 10% Total 100% Table 5.3. Percent Frequency Distribution Table for Soft Drink Purchases Using RStudio, on the other hand, the task can be completed by running the following R code in the Console window. We will use the “purchase.csv” file in our working directory. R Script Be sure to have the readr and pander packages installed before running the following scripts. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 57 Running the last script yields the following frequency table. Coke Classic Diet Coke Dr. Pepper Pepsi Sprite 19 8 5 13 5 We now transform the preceding output table by means of the following scripts. We now have the frequency distribution table for the given data. Frequency Coke Classic 19 Diet Coke 8 Dr. Pepper 5 Pepsi 13 Sprite 5 Table 5.4. Frequency distribution table of soft drink purchases The same R code or script can also be written in the Source window or pane if you want to keep a copy of the scripts you write in RStudio. First, we create a new R script file by clicking on the File menu, then click on New File and select R Script. The same result can be obtained by using the hot keys Ctrl+Shift+N. Write the R code on the Source window. You should be able to have something similar to Figure 5.2 below. Save the R script file. R script files are named with an.R extension. Click on the save icon on the Source window and browse to your set working directory. Name the file as purchase.R. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 58 After saving the file, execute the script by highlighting all the lines on the Source window and then clicking on the ‘Run’ icon on the upper right part of the Source window. As an alternative to the ‘Run’ icon, you can press on the Ctrl+Enter keys to run the script. Take note of this. Figure 5.2. R script for the frequency distribution table for the soft drink purchase data. For the relative frequency table, we can run the following R script. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 59 R Script Relative Frequency Coke Classic 0.38 Diet Coke 0.16 Dr. Pepper 0.1 Pepsi 0.26 Sprite 0.1 Table 5.5. Relative frequency distribution table of data on soft drink purchases. Note that since the dataset was already imported in RStudio when we constructed the frequency distribution table, there is no need to import the data again. Also, since the packages were already installed and loaded from the previous R scripts, there is no need to repeat these commands. Example 2: A survey was taken in Aurora Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 Table 5.6 shows the frequency, relative frequency and percent frequency for the data in just one table. Note that in practice, it is customary to only include one such type of frequency. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 60 Number of cars Frequency Relative Frequency Percent Frequency 0 4 0.20 20 % 1 6 0.30 30 % 2 5 0.25 25 % 3 3 0.15 15 % 4 2 0.10 10 % Table 5.6. Frequency distribution table for the number of cars registered in each household In this example, the frequency table constructed is for ungrouped data, which means that the individual values do not lose their identity in the table. Doing this in RStudio, let us consider a different approach by instead constructing a vector representing the data values. Open a new R script file then enter and run following script. R Script Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 61 Frequency Relative Frequency Percent Frequency 0 4 0.2 20 1 6 0.3 30 2 5 0.25 25 3 3 0.15 15 4 2 0.1 10 Table 5.7. Frequency distribution table for the number of cars registered in each household. Let us now consider creating a grouped frequency distribution table. This table is constructed for usually large sets of quantitative data. Grouped Frequency Distribution Table Quantitative data can be grouped in class intervals (or intervals of values). We note, however, that individual quantitative or numerical characteristic are lost in a frequency distribution table. It is not advisable for us to create a grouped frequency distribution of quantitative data and use this distribution table in computing for descriptive measures as well as when performing procedures for statistical inference since significant information has been lost when the raw data values were classified according to certain intervals. The grouped frequency distribution table is simply created as a convenient means of organizing and summarizing data. The following are some important terms in relation to the creation of a grouped frequency distribution table. 1. Class Intervals (or Class Limits) It refers to the grouping of data defined by a lower limit and an upper limit. 2. Class width (or Class Size) It is the difference between two adjacent lower limits or two adjacent upper limits. 3. Class frequency The number of observations belonging to a class interval Steps in Constructing a Grouped Frequency Distribution Table 1. Determine the number of classes or the number of class intervals required. Frequency distribution tables usually have 6 to 20 classes. We can use Sturges’ formula to calculate the number of classes. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 62 Sturges’ Formula: = 1 + 3.322 log10 where is the number of observations. 2. Calculate the class size or class width. The class size is the distance from the upper limit of one class to the upper limit of an adjacent class. – = As a rule of thumb, we always round up the class size or class width to the same number of decimal places as in the raw data if it is not exact. 3. Enumerate the class intervals. Lower limit (LL) of the first class interval is the lowest score (unless otherwise specified) LL of succeeding intervals = LL of preceding interval + Upper limit (UL) = value that comes before the succeeding lower limit = UL of preceding interval + 4. Tally the observations. Let us take a look at example 3 to illustrate the process of creating a grouped frequency distribution table. Example 3: Consider the following data set on the monthly rent ($) for a sample of 70 one-bedroom apartments: 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615 Solution: 1. Determine the number of class intervals. = 1 + 3.322 log10 = 1 + 3.322 [ log10 (70)] = 7.129 ≈ 7 We create a frequency distribution table with 7 class intervals. 2. Compute for the class width. – 615 – 425 = = = 27.14 ≈ 28 7 We then perform the rest of the steps (3 and 4) and we come up with the following frequency distribution table for the given data. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 63 In this case, the individual values are grouped together in each class, and are no longer visible. Rent (in $) Frequency 425-452 25 453-480 16 481-508 8 509-536 7 537-564 2 565-592 6 593-620 6 Total 70 Table 5.8. Frequency table for monthly rents of 70 one-bedroom apartments. To create the Grouped Frequency Distribution Table using R, we consider the following R script and we make use of the rent.csv file in our data repository or working directory. R Script First we load the necessary packages and then import the “rent.csv” data into RStudio. We assign it to the object which we are going to name as “rent”. We can then view the imported data by using the View() function. Next, we define the class intervals. We create class intervals with equal widths of 28. Notice from our manual solution that the upper limit of the highest class interval is 620. We need to adjust this value to 621 in order for us to create class intervals with equal widths. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 64 Then we now assign each observation into their corresponding class intervals. This can be achieved by means of the cut() function. This function divides the range of rent values (425 to 621) into equal intervals and correspondingly codes or classifies each rent value in accordance to which class interval they fall. The ‘$’ symbol in the syntax is used to specify the variable of interest from the data. This symbol is very useful especially if we have a data frame with a number of variables in it and we just would like to specify a particular variable from the data frame for our analysis or investigation. The argument “right” is a logical argument that indicates if the intervals should be closed on the right (and open on the left) or vice versa. We set the value of this argument to “FALSE” for our intervals to be closed on the left and open on the right. We then perform the rest of the steps in creating a frequency distribution table, similar steps used in creating the frequency distribution tables from our preceding examples. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 65 Frequency [425,453) 25 [453,481) 16 [481,509) 8 [509,537) 7 [537,565) 2 [565,593) 6 [593,621) 6 Table 5.9. Frequency table for monthly rents of 70 one-bedroom apartments. In the output, a bracket on the left endpoint means that the value is included in the class interval, while a parenthesis in the right endpoint means the value is not included in the interval. For example, [537, 565) indicates that the class interval has lower limit of 537 and an upper limit of 564 (the value 565 is not included). BAR GRAPH A bar graph is a chart used to display qualitative data summarized in a frequency, relative frequency, or percent frequency distribution. For a vertical bar chart, the horizontal (x) axis represents the categories; the vertical (y) axis represents a value (frequency, relative frequency, or percent frequency) for those categories. In the graph below, the values are frequencies. The figure below shows the bar chart of the data on soft drink purchases of Example 1. Fig. 5.3. Bar graph of data on soft drink purchases. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 66 R Script To construct the bar chart using RStudio, we use the ggplot function. Using the “purchase.csv” data, open a new R script file, enter and run the following script. Note that if a package has already been installed in RStudio, we don’t need to install the same package again whenever we are to perform any analysis. All we have to do is load these installed packages. Fig. 5.4. RStudio output of bar graph of data on soft drink purchases. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 67 If we would like to present the bar graph with the bars in decreasing frequencies, Fig. 5.5. RStudio output of bar graph of data on soft drink purchases with bars in decreasing frequencies. Just a note, you may not assign the bar graphs into the objects bar1 and bar2. Removing these assignments in the script would generate the bar charts right away. Also, the bars will be shown in the plots window of RStudio where you have the options to “Save as Image”, “Save as PDF”, or “Copy to Clipboard” once you click of the “Export” icon on the Plots window. PIE CHART A pie chart (also called a pie graph or circle graph) provides another graphical device for presenting relative frequency and percent frequency distributions for qualitative data. The numerical values shown for each sector can be frequencies, relative frequencies, or percent frequencies, which subdivides the circles into sectors. A pie chart makes use of sectors (slices) in a circle. The angle of a sector is proportional to the frequency of each of the categories of the variable that defines the data. The formula to determine the angle of a sector in a circle graph is: Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 68 = × 360 The figure below shows the pie chart of the data on soft drink purchases of Example 1 generated using Microsoft Excel. Fig. 5.6. Pie chart of data on soft drink purchases. R Script Suppose we start with the raw data, the following is the script in creating a simple pie chart in RStudio. We use the “purchase.csv” file for the same example. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 69 Fig. 5.7. RStudio output of pie chart of data on soft drink purchases. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 70 DOT PLOT A dot plot is a graphical display of data using dots. It is similar to a bar graph because the height of each “bar” of dots is equal to the number of items in a particular category. To draw a dot plot, count the number of data points falling in each category and draw a stack of dots that number high for each category. A dot plot can be used as a graphical display of the frequency of qualitative and quantitative (ungrouped) data. The figure that follows shows the dot plot for the data of Example 2 on the number of cars registered to each household: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 Number of cars Frequency 0 4 1 6 2 5 3 3 4 2 Table 5.10. Frequency distribution table for the number of cars registered in each household R Script Here we present two ways by which a dot plot is constructed. First is by using the ggplot function and importing a.csv data file from MS Excel. This is very useful especially if we have a large data set. The other way is by using the stripchart function. We also use a different approach in having our data in RStudio. We create a data vector instead of using the imported.csv file. Creating a data vector is applicable if we would be dealing with a small set of data where it is relatively manageable to work with. The following are the scripts. For the first method, we use the “cars.csv” data from our directory. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 71 Fig. 5.8. RStudio dot plot output of data on number of cars registered in each household. Using the stripchart function, we have the following script. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 72 Fig. 5.9. RStudio dot plot output, using the stripchart function, of data on number of cars registered in each household. Notice the difference in dot sizes with different binwidths. You can further explore RStudio functionality by varying the values of the arguments in the syntax. For the stripchart function, the following are the descriptions of the arguments. x: the data from which the plots are to be produced. method: the method to be used to separate coincident points. The value of this argument was set to “stack” in order to have coincident points stacked. at: a numeric vector giving the locations where the charts should be drawn. pch: either an integer specifying a symbol or a single character to be used in plotting points. (20 is for a dot). cex: a numeric value that determines the amount by which plotting text and symbols should be magnified. las: a numeric value that determines the style of axis labels. (0 - always parallel to the axis (default); 1 - always horizontal; 2 - always perpendicular to the axis; 3 - always vertical). frame.plot: a logical argument indicating whether a box should be drawn around the plot. xlim: the x limits of the plot. main: a main title for the plot. We can always use the “Help” tab in RStudio to learn more about a particular function, particularly the arguments that we need to define for the function to work. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 73 STEM-AND-LEAF PLOT A stem-and-leaf plot is a graphical display for quantitative data that shows both the rank order and shape of a data set. It is particularly useful when data are not too numerous. Stem-and-leaf plots are a method for showing the frequency with which certain classes of values occur. Example 1: The following illustration and steps are taken from the website: https://study.com/academy/lesson/how-to-make-a-stem-and-leaf-plot.html The process will be easiest to follow with sample data, so let's pretend that a sports statistician wants to make a stem-and-leaf plot for a recent game played by the Blues basketball team. The total minutes played by each team member has been recorded and shown below: Blues Member Name Minutes Played Gifford 22 Slavky 29 Harrison 22 Samon 31 Mantry 20 Lewing 12 Wilson 14 Larriby 24 Paston 13 Lebling 4 Waster 2 Canno 1 Step 1: Determine the smallest and largest number in the data. Looking at the stats, we see the number of minutes played ranges from a low of 1 minute to a high of 31 minutes. Step 2: Identify the stems. For any number, the digit/s to the left of the right-most digit is a stem. For example, the number 31 has a stem of 3, while the number 29 has a stem of 2. A one-digit number like 4 has a stem of 0. Think ''04'' for 4.Based on the range of 1 to 31, we need stems of 0, 1, 2 and 3. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 74 Step 3: Draw a vertical line and list the stem numbers to the left of the line. 0| 1| 2| 3| Step 4: Fill in the leaves. The first data value is for Gifford who played 22 minutes. The stem is on the left. The leaf is on the right. 0| 1| 2|2 3| Let's enter Lebling's 4 minutes. The stem is 0 and the leaf is 4. 0|4 1| 2|2 3| Entering the rest of the data: 0|4 2 1 1|2 4 3 2|2 9 2 0 4 3|1 Step 5: Sort the leaf data. The stem-and-leaf plot is easier to interpret when each row's leaves are sorted from low to high. 0|1 2 4 1|2 3 4 2|0 2 2 4 9 3|1 And that's the stem-and-leaf plot for minutes played. The place value of the leaf is called the leaf unit. In the example above, the leaf unit is 1. Other leaf units may be 100, 10, 0.1, and so on. If the leaf unit is not 1, it should be displayed in the stem-and-leaf plot. R Script For the same example, the stem and leaf plot can be generated in RStudio by using the stem() function. The script is very short. Try this out in RStudio. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 75 And the corresponding output is Example 2: The stem-and-leaf plot for the data set 8.6 11.7 9.4 9.1 10.2 11.0 8.8 with leaf unit 0.1 is given by This means that in reading the data from the stem-and-leaf plot, the stems are digits in the units place while the leaves are the digits in tenths place (first decimal place). R Script Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 76 The corresponding output is Example 3: Let us now consider a data frame for this example. In MS Excel, open the data file “inflation.csv”. The data shows the Inflation rate (in %) of countries in Asia and the Pacific. Upon inspection of the variables, you would notice that there is only one quantitative variable which is the inflation rate, labeled “Inflation”. We now create a stem-and-leaf display for this variable. R Script Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 77 The stem-and-leaf plot for the data is If the scale argument was given a value of 1, this shortens the length of the stem-and-leaf plot. This in turn causes some of the observations to be “disregarded” by the plot, i.e., there would be some values which will not be presented in the output plot. The value of the scale argument would depend on the quantity of data that we have. You can vary the value of the scale argument to see the effect of varying its value on the stem- and-leaf plot generated. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 78 HISTOGRAM A histogram is a graphical portrayal of the frequency distribution of grouped data. It divides the data set into class intervals and gives the frequency for each class. Histograms are particularly useful for summarizing large sets of data. The histogram corresponding to the frequency distribution table for the data on monthly rent ($) for a sample of 70 one-bedroom apartments in Example 3 is shown below: Rent (in $) Frequency 425-452 25 453-480 16 481-508 8 509-536 7 537-564 2 565-592 6 593-621 6 Total 70 Table 5.11. Frequency distribution table for monthly rents of 70 one- bedroom apartments Fig. 5.10. MS Excel output chart for histogram of monthly rent data. As you create a grouped frequency distribution table, it would be advisable to create te corresponding histogram as well. Creating the histogram in RStudio, we use the hist() function. We have the following script for the same example above. R Script To plot the histogram for the same example, again we use the “rent.csv” file. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 79 Here we have the output histogram. Fig. 5.11. RStudio output of histogram for data on monthly rent. Summarizing Qualitative and Quantitative Data for Two Variables Tabular and graphical displays for data obtained from two variables are helpful in understanding the relationship between them, if any. In this section we will discuss thecrosstabulation or contingency table and the scatter diagram. CROSSTABULATION A crosstabulation or contingency table is a tabular summary of data for two variables. The variables can both be qualitative or both quantitative, or can be a combination of one qualitative and one quantitative variable. If either variable is quantitative, classes must be created for the values of the quantitative variable. The labels shown in the margins of the table define the categories (classes) for the two variables. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 80 Example: For an example, we consider the “salaries.csv” file which contains data on professors of a university, including rank, discipline being taught, years since PhD was obtained, years of service in the university, sex, and annual salary ($). We construct a crosstabulation of the rank and sex of the teachers. Using RStudio, we can generate the crosstabulation shown in Table 6. The following is the RStudio Script. Make sure that the summarytools package has already been installed before executing the following commands. R Script Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 81 The following are the outputs from RStudio. cross_table: Female Male Total AssocProf 10 54 64 AsstProf 11 56 67 Prof 18 248 266 Total 39 358 397 Table 5.12. Crosstabulation of rank and sex of teachers. proportions: Female Male Total AssocProf 0.1562 0.8438 1 AsstProf 0.1642 0.8358 1 Prof 0.06767 0.9323 1 Total 0.09824 0.9018 1 Table 5.13. Crosstabulation showing proportions of rank and sex of teachers From the crosstabulation, we can see that majority of the teachers have a rank of ‘Professor’. There are relatively more males than females among all the ranks, and teachers who are male professors make up the largest group. This could not have been easily observed by just looking at the raw data. SCATTER DIAGRAM/PLOT A scatter diagram or scatter plot is a graphical display of the relationship between two quantitative variables. One variable (independent variable) is shown on the horizontal axis and the other variable (dependent variable) is shown on the vertical axis. The general pattern of the plotted points suggests the overall relationship between the variables. This relationship will be discussed more in Modules 11 (Correlation and Regression). Example: Consider the advertising/sales relationship for a stereo and sound equipment store. On 10 occasions during the past three months, the store used weekend television commercials to promote sales at its stores. The managers want to investigate whether a relationship exists between the number of commercials shown and the sales at the store during the following week. Sample data for the 10 weeks with sales in hundreds of dollars are shown in the table. The figure that follows is a scatter diagram for the data. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 82 Week Number of Commercials Sales ($100s) 1 2 50 2 5 57 3 1 41 4 3 54 5 4 54 6 1 38 7 5 63 8 3 48 9 4 59 10 2 46 Fig. 5.12. MS Excel output of scatter graph for Advertising vs Sales data. R Script Here we present two scripts in generating the scatter plot for the same problem. The example data is contained in the “advertising.csv” data file. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 83 The following is the RStudio output. Fig. 5.13. RStudio output of scatter graph for Advertising vs Sales data. We can also assign long variable names to simple object names in RStudio. Let us take a look at the following scripts. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 84 The same output as Fig. 5.13 results from the preceding scripts. There are many other plots or graphs that can be generated and these of course would always depend on the nature of the data as well as the objectives of the purpose for which the data was gathered. For more information about the use of R/RStudio in creating graphs or charts, a vast number of internet sites and instructional videos can be utilized. The things that we have considered in these module are the basic or commonly used means of data presentation, hence our learning should not be confined to what was only presented here.  Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 85 Learning Reinforcement Activity No. 5: DATA PRESENTATION Use short bond papers with 1-inch margin on all sides. Your answers can be handwritten be computerized. If it is computerized, save it as a single PDF file; if it is handwritten, scan or take a photo/picture of each page, copy and paste the photo/picture to a WORD document and save it as a single PDF file (or you can use any means, like a mobile App, to scan and save it as a single PDF file). The Filename should be: For example, BALLENAJaime_Reinforcement5 will be my output for Learning Reinforcement 5. Use RStudio to construct the tabular and/or graphical displays required for each problem. For your output presentation for each problem, follow the format. i. Problem. Present/copy the problem. ii. R Script: Present the scripts that you have created. iii. Output: Present the output graph/chart and/or table. iv. Interpretation: Give a brief interpretation of the graph. Describe the data based on the graph/chart output. 1. The following table shows the data for 50 vehicle purchases during the last 5 years made at a certain car dealership. Suzuki Suzuki Toyota Suzuki Toyota Ford Honda Ford Suzuki Suzuki Toyota Suzuki Ford Ford Toyota Toyota Ford Suzuki Suzuki Ford Mitsubishi Honda Mitsubishi Ford Mitsubishi Ford Ford Honda Suzuki Ford Honda Toyota Toyota Suzuki Suzuki Mitsubishi Ford Ford Honda Mitsubishi Honda Ford Toyota Toyota Honda Suzuki Suzuki Toyota Ford Mitsubishi Table 5.15. Data for 50 vehicle purchases a. Construct a table showing the frequency, relative frequency, and percent frequency distribution for the given data. (10 points) b. Construct a bar chart and a pie chart for the sample. (10 points) 2. The data below shows the time in days required to complete year-end audits for a sample of 20 clients of Sanderson and Clifford, a small public accounting firm. Construct a dot plot for the sample. (5 points) Year-end Audit Time (in days) 12 20 14 15 21 18 22 18 17 13 15 22 14 27 18 19 33 16 23 28 Table 5.16. Data on days required to complete year-end audits. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 86 3. Dividend yield is the annual dividend paid by a company expressed as a percentage of the price of the stock (Dividend/Stock Price x 100). The dividend yields for the top dividend paying stocks in the Philippines for 2021 is shown in Table 5.17. a. Construct a table showing the grouped frequency distribution and percent frequency distribution. (10 points) b. Construct a frequency histogram. (5 points) Company Dividend Company Dividend Yield (%) Yield (%) Aboitiz Equity Ventures 2.39 Manila Electric 4.85 Aboitiz Power Corp. 3.68 Megaworld 1.33 Ayala Corp. 0.95 Metro Pacific Investments 2.31 Ayala Land 0.41 Metrobank 2.28 BPI 2.17 PLDT 6.27 BDO 1.14 Puregold 1.13 Century Pacific Food 1.41 Robinsons Land 3.06 D&L Industries 1.78 Robinsons Retail Holdings 1.59 DMCI Holdings 8.12 San Miguel Corp. 1.32 Filinvest Land 2.82 San Miguel Food & Beverage 1.97 First Gen Corp. 2.00 Security Bank 2.73 Globe Telecom 5.82 SM Investments Corp. 0.46 GT Capital Holdings 0.56 SM Prime Holdings 0.26 Int’l Container Terminal 1.49 Universal Robina Corp. 2.48 JG Summit Holdings 0.67 Vista Land 1.47 Jollibee Foods Corp. 0.41 Wilcon Depot 0.53 LT Group 4.94 Congratulations! You have just completed Module 5. You are getting acquainted with the R software. In the next module, we will start computing descriptive measures. Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 87

MODULE 5 DATA PRESENTATION PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue