GE MATH2 Module 5: Data Management PDF
Document Details
Uploaded by RomanticCedar
Central Philippine University
Tags
Summary
This document is a module from a course titled 'Mathematics in the Modern World'. It focuses on data management, covering descriptive and inferential statistics, data types (nominal, ordinal, interval, ratio), key statistical concepts (population, variable, constant), and statistical tools. It aims to equip students with the knowledge to process and manage numerical data for analysis and prediction.
Full Transcript
MATHEMATICS IN THE MODERN WORLD MODULE FIVE DATA MANAGEMENT CORE IDEA Statistical tools derived from mathematics are useful in processing and managing numerical data to describe a phenomenon and predict values. Learning Outcome: 5. Use a variety of st...
MATHEMATICS IN THE MODERN WORLD MODULE FIVE DATA MANAGEMENT CORE IDEA Statistical tools derived from mathematics are useful in processing and managing numerical data to describe a phenomenon and predict values. Learning Outcome: 5. Use a variety of statistical tools to process and manage numerical data. 6. Use the methods of linear regression and correlations to predict the value of a variable given certain conditions. 7. Advocate the use of statistical data in making important decisions. Unit Lessons: Lesson 5.1 The Data Lesson 5.2 Measures of Central Tendency Lesson 5.3 Measures of Dispersion Lesson 5.4 Measures of Relative Position Lesson 5.5 Normal Distributions Lesson 5.6 Linear Correlation Lesson 5.7 Linear Regression Time Allotment: Ten lecture hours 215 MATHEMATICS IN THE MODERN WORLD Lesson 5.1 The Data Specific Objectives 1. To Understand the nature of statistics. 2. To gain deeper insights on the different levels. of measurements. 3. To clarify the meaning of some important key concepts. 4. To explore the strengths and limitations of graphical representation. It is written in the Holy Book that “the truth shall set us free;” therefore, understanding statistics paves the way towards intellectual freedom. For without sufficient knowledge about it, we may be doomed to a life of half-truth. Statistics will provide deeper insights to critically evaluate information and to bring us to the well-lit arena of practicality. Discussions General Fields of Statistics: Descriptive Statistics and Inferential Statistics Descriptive Statistics. If statistics, in general, basically deals with analysis of data, then descriptive statistics part of the general field is about “describing” data in symbolic forms and abbreviated fashions. Sometimes we dealing with a large amount of data and that it is impossible to describe it as it is being a large amount 216 MATHEMATICS IN THE MODERN WORLD of data but descriptive statistics will provide us certain tools to make the data manageable to handle and conveniently neat to describe. To explore the characteristics of descriptive statistics, let us create a fictitious situation. What does it mean if someone tells you that majority of workers earn approximately P20,000.00 in a month? Were you able to dissect the idea behind the plain statement? Does it trigger your mind to question further? This statement is a piece of information that described a particular trait or characteristic of a group of workers. Supplied with this singular information but armed with statistical inquisitiveness, descriptive statistics can further describe the given information to the extent of its depth and breadth. Inferential Statistics. We could probably argue that descriptive statistics, with its characteristic to describe, is sufficient to depict any given information. While it is effective to describe a manageable size of data, it can hardly engulf a sizeable amount of data. Thus, for this kind of situation, inferential statistics is the alternative technique that can be used. Inferential statistics has the ability to “infer” and to generalize and it offers the right tool to predict values that are not really known. Let us consider the fictitious situation we made under descriptive statistics, but this time instead of reporting the approximate monthly earning of some workers, we want to determine the estimated monthly earnings of all the workers in a certain region. By attempting to apply descriptive statistics, it would be impossible to ask all the workers in the entire region about their monthly income. But by using inferential statistics, we would instead practically decide to select just a small number of workers and ask them of their monthly income. From there, we can predict or approximate in a “more or less” fashion the monthly income of all workers in the entire region. 217 MATHEMATICS IN THE MODERN WORLD Of course, inference or generalization is a risky process that is why we need to ensure that the small group of workers we selected are the approximate representative of the workers in the entire region. But nevertheless, this inference or prediction is better than chance accuracy. Measurement It essentially means quantifying an observation according to a certain rule. For instance, the presence of fever can be quantified by using a thermometer. Body weight can be determined by using a weighing scale. Or the mental ability can be quantified by using written examination that can generate scores. The quantification sometimes can be done is simply counting. In quantifying an observation, there are two types of quantitative informations: variable and constant. A variable is something that can be measured and observed to vary. While a constant is something that does not vary, and it only maintains a single value. Scales of Measurement - Nominal Scale : Categorical Data - Ordinal Scale : Ranked Data - Interval/Ratio Scale : Measurement Data To quantify an observation, it is necessary to identify its scale of measurement, it is known as level of measurement. Scale of measurement is the gateway to the fascinating world of statistics. Without sufficient knowledge of it, all our statistical learnings lead to nowhere. Nominal Scale. It concerns with categorical data. It simply means using numbers to label categories. This is done by counting the occurrence of frequency within categories. One condition is that the categories must be independent or mutually exclusive. This implies that once something is identified under a certain category, then that something cannot be reassigned at the same time to another category. 218 MATHEMATICS IN THE MODERN WORLD An example for this, if we want to measure a group of people according to marital status. We can categorize marital status by simply assigning a number. For instance “1” for single and “2” for married. Marital Status: Single (1) and Married (2) (1) (2) Obviously, those numbers only serve as labels and they do not contain any numerical weight. Thus, we cannot say that married people (having been labelled 2) have more marital status than single people (having been labelled 1). Ordinal Scale: It concerns with ranked data. There are instances wherein comparison is necessary and cannot be avoided. Ordinal scale provides ranking of the observation in order to generate information to the extent of “greater than” or “less than;”. But the ranked data generated is limited also the extent of “greater than” or “less than;”. It is not capable of telling information about how much greater or how much less. Ordinal scale can be best illustrated in sports activities like fun run. Finding the order finish among the participants in a fun run always come up with a ranking. However, ranked data cannot provide information as to the difference in time between 1st placer and 2nd placer. Relative to this, reading reports with ordinal information is also tricky. For example, a TV commercial extol a certain brand for being the number one product in the country. This may seem acceptable, but if you learned that there is no other product then definitely the message of the commercial will be swallowed with an smirking face. 219 MATHEMATICS IN THE MODERN WORLD Interval Scale: It deals with measurement data. In the nominal scale, we use numbers to label categories while in the ordinal scale we use numbers to merely provide information regarding greater than or less than. However, in interval scale we assign numbers in such a way that there is meaning and weight on the value of points between intervals. This scale of measurement provides more information about the data. Consider the comparative illustration below: Academic performance of five students in a certain class Student A Student B Student C Student D Student E Interval Data 99 74 73 70 70 Ordinal Data 1st 2nd 3rd 4th 5th Nominal Data Passed Failed Failed Failed Failed As you may have noticed, the interval scale provides substantial information about the grades of students. Student A earned a grade of 99, and so on and so forth. Now look at the information given by ordinal data. It is simply about ranking. With this of information, Student B can proudly and rightfully claim the 2 nd place in the ranking. Ordinal scale is a trusted friend to keep a secret, that the grade of student B though claiming 2nd place is actually 74. Let us analyze the nominal data in our example. With this scale, it is also alright for the school sadly to announce that only one student passed and four students failed. Nominal data cannot provide more information specifically provide brighter limelight to student A. Audience may assume that Student A just got passing grade a little bit higher than the passing mark but student A grade of 99 will remain hidden forever. Ratio Scale. This is an extension of an interval scale. It also pertains with measurement data but ratio’s point of view is about absolute value. Because of this, we oftentimes cannot utilize ratio scale in the social sciences. We cannot justify an absolute value to gauge intelligence. We cannot say that our student A with a grade of 99 has an intelligence several points superior than student E who hardly but successfully achieved a grade of 70. 220 MATHEMATICS IN THE MODERN WORLD Key Concepts in Statistics Population. A population can be defined as an entire group people, things, or events having at least one trait in common (Sprinthall, 1994). A common trait is the binding factor in order to group a cluster and call it a population. Merely having a clustering of people, things or events cannot be considered as a population. At least one common trait must be established to make a population. But, on the other hand, adding too many common traits can also limit the size of the population. In the illustration below, notice how a trait can severely reduce the size or membership in the population. A group of students (this is a population, since the common trait is “students”) A group of male students. A group of male students attending the Statistics class A group of male students attending the Statistics class with iPhone A group of male students attending the Statistics class with iPhone and Earphone As we read the list, we can mentally visualize that the size of the population is dramatically becoming smaller and as we add more traits we may wonder if anyone still qualifies. The more common traits we add, the more we reduce the designated population. Parameter. In gauging the entire population, any measure obtained is called a parameter. Situationally, if someone asks you as to what is the parameter of the study, then bear in mind that he is referring to the size of the entire population. In some situations where the actual size of the population is difficult to obtain; the parameters are in the form of estimate or inference. Sample. The small number of observation taken from the total number making up a population is called a sample. As long as the observation or data is not the totality of the entire population, then it is always considered a sample. For instance, in a population of 100, then 1 is considered as a sample. 30 is clearly a sample. It may seem absurd but 99 taken from 100 is still considered a sample. Not until we include 221 MATHEMATICS IN THE MODERN WORLD that last number (making it 100) could we claim that it is already a population and no longer a sample. Statistic. In gauging the sample, any measure obtained from the sample is called a statistic. Whenever we describe the sample, then it is called statistics. Since a sample is easier to observe or gather than the population, then statistics are simpler to gather than the parameter. Graphical representation Graphs. It is another way to visually show the behavior of data. To create a graph, distribution of scores must be organized. For instance, in the scores provided below, presenting the scores in an unorganized manner can provide confusing or no information at all; Reporting raw can even hide some significant scores to be noticed. 120, 65, 110, 75, 105, 80, 105, 85, 100, 85, 100, 90, 95, 90, 90 But when we arrange the scores from highest to lowest, which is a form of score distribution, some pieces of information can gradually brought forth and exposed. Distribution of Scores 120 110 105 105 100 100 95 90 90 90 85 85 80 75 65 222 MATHEMATICS IN THE MODERN WORLD The score distribution can still be organized in a form of a frequency distribution. Frequency distribution provides information about raw scores, and the frequency of occurrences. Frequency distribution provides clearer insights about the behavior of scores. X f (Raw score) (Frequency of Occurrence) --------------------------------------------------------------------------- 120 1 110 1 105 2 100 2 95 1 90 3 85 2 80 1 75 1 65 1 ------------------------------------------------------------------------ Another alternative way of presenting data in frequency distribution is to present them in a tabular form. A tabular form has the advantage of showing the visual representation of the data. This kind of presentation is more appealing to the general audience. Frequency of Occurrence 3 2 1 0 60 70 80 90 100 110 120 130 Raw scores 223 MATHEMATICS IN THE MODERN WORLD Another way of showing the data in graphical form is by using Microsoft Excel, as also illustrated in the graphs below. It is the frequency polygon of the scores in our cited example above. Notice in the illustration of the frequency polygon, the two graphs may appear different but they are actually the same and they disclose the similar information. This illustration will allow you realize that unless you see things with a critical eye, a graph can create a false impression of what the data really reveal. This is an obvious situation showing how graphs can be used to distort reality if you are not equipped with a critical statistical mind. This type of deceitful cleverness in distorting graphs is common in some corporations devising the tinsel to camouflage and also to portray some gigantic leaps in sales in order to attract more clients or buyers. 224 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.1 Indicate which scale of measurement- nominal ordinal or interval is being used. 1. Both Globe and Smart phone number prefix 0917 and 0923 served 1 million and 2.5 subscribers, respectively. 2. The Philippine Statistics Office announces that the average height of Filipino male is 156.41 cm tall. 3. Postal Office shows that 4,231 individuals have a zip code of 4231. 4. The Sportsfest committee posted the names of individuals with their order of finish for the first 50 runners to reach the finish line. 5. The University Admission Office posted the names and scores of student applicants who took the entrance examination. 225 MATHEMATICS IN THE MODERN WORLD Lesson 5.2 Measures of Central Tendency Specific Objectives : 1. To know the different measures of central tendency. 2. To comprehend the limitations of the three measures. 3. To realize the effect of the measures in the distribution. 4. To critically know how to select appropriate measure to describe a certain distribution. Discussion As we venture into the realm of descriptive statistics, let us now focus in describing the nature of a quantitative data. By using an appropriate descriptive technique, we can organize and neatly summarize small amounts and large amounts of data distribution. The procedure, utilizing measures of central tendency, allows us to precisely describe the centrality of data distribution. Measures of central tendency are methods that can used to determine information regarding average, ranking, and category of any data distribution. Mean, median and mode are the three tools in obtaining the measures of central tendency. But only by knowing and using the appropriate tool that most accurate estimation of centrality can be achieved. The objective of the measures of central tendency is to describe the centrality of the distribution into a single numerical unit. This single numerical unit must provide clear description about the common trait being observed in the distribution of scores. 226 MATHEMATICS IN THE MODERN WORLD The Mean The most widely used measure of the central tendency is the mean ( ). It is the arithmetic average of all the scores. The mean can be determined by adding all the scores together and then by dividing by the total number of scores. The basic formula for the mean is as follows: The operational term “summation” The raw scores indicating to add all measures of 𝑥 ∑𝑥 = 𝑁 The entire number of observations being dealt with Mean In the example below concerning the annual income of 12 workers, the mean can be found by calculating the average score of the distribution. X =========================== Php 200,000.00 200,000.00 195,000.00 194,000.00 194,000.00 194,000.00 193,000.00 190,000.00 185,000.00 180,000.00 180,000.00 176,000.00 =========================== ∑ 𝑥 = Php 2, 281,000.00 ∑𝑥 2,281,000.00 = = =Php 190,083.00 𝑁 12 227 MATHEMATICS IN THE MODERN WORLD In this example, the mean is an appropriate measure of central tendency because the distribution is fairly well-balanced. This means that there are no extremely high or extremely low scores in either direction that can unusually influence the average of the scores. Thus, the mean value of 190,083.00 represents the total picture of the distribution (i.e. annual incomes). This means that in a “more or less” or approximate fashion it describes the entire distribution. Mean of Skewed Distribution. There are situations wherein the mean cannot be trusted to provide a measure of central tendency because it portrays an extremely distorted picture of the average value of a distribution of scores. For instance, let us still consider our example of annual incomes but this time with some adjustment. Let us introduce another score. The annual income of an affluent new neighbor who happened to move to this town just recently. This new neighbor has a frugal high annual income so extremely far above the others. X =========================== New neighbor Php 2, 500,000.00 200,000.00 200,000.00 195,000.00 194,000.00 194,000.00 194,000.00 193,000.00 190,000.00 185,000.00 180,000.00 180,000.00 176,000.00 =========================== ∑ 𝑥 = Php 4, 481,000.00 ∑𝑥 4,281,000.00 = = =Php 367,769.00 𝑁 13 228 MATHEMATICS IN THE MODERN WORLD As you may have noticed, the mean income of Php 367,769.00 this time provides a highly misleading picture of great prosperity for this neighborhood. The distribution was unbalanced by an extreme score of the new affluent neighbor. This is what we call an skewed distribution. Here are some graphic illustration of a skewed distribution: When the tail goes to the right, the curve is positively skewed; when it goes to the left, it is negatively skewed. The skew is in the direction of the tail-off of scores, not of the majority of scores. The mean is always pulled toward the extreme score in a skewed distribution. When the extreme score is at the low end, then the mean is too low to reflect centrality. When the extreme score is at the high end, the mean is too high. The Median The median is the point that separates the upper half from the lower half of the distribution. It is the middle point or midpoint of any distribution. If the distribution is made up of an even number of scores, the median can be found by determining the point that lies halfway between the two middlemost scores. 193,000.00 190,000.00 (190,000+185,000) 185,000.00 Median= 2 180,000.00 229 MATHEMATICS IN THE MODERN WORLD Arranging scores to form a distribution means listing them sequentially either highest to lowest or lowest to highest. Unlike the mean, the median is not affected by skewed distribution. Whenever the mean cannot provide centrality because of extreme scores present, the median can be used to provide a more accurate representation. Calculation of the Median X =========================== ➔➔➔ Php 2, 500,000.00 200,000.00 200,000.00 195,000.00 194,000.00 194,000.00 194,000.00 ----- 194,000.00 Median 193,000.00 190,000.00 185,000.00 180,000.00 180,000.00 176,000.00 =========================== As you observed, even with the presence of extreme score at the high end of the distribution- the value of the median is still undisturbed. The Mode Another measure of central tendency is called the mode. It is the most frequently occurring score in a distribution. In a histogram, the mode is always located beneath the tallest bar. 230 MATHEMATICS IN THE MODERN WORLD Finding the mode of a distribution of raw scores (Annual Income) X =========================== Php 2, 500,000.00 200,000.00 200,000.00 195,000.00 194,000.00 194,000.00 Mode 194,000.00 193,000.00 190,000.00 185,000.00 180,000.00 180,000.00 176,000.00 =========================== The mode provides an extremely fast way of knowing the centrality of the distribution. You can immediately spot the mode by simply looking at the data and find the dominant constant. It is the frequently occurring scores. Appropriate Use of the Mean, Median and Mode The best way to illustrate the comparative applicability of the mean, median and mode is to look again at the skewed distribution. 231 MATHEMATICS IN THE MODERN WORLD 10,000 Frequency of Occurrence Mode 100,000 Mean 20,000 Median Distribution of monthly income per household in a certain municipality. Most income is always skewed to the right because the low end has a fixed limit of zero while the high end has no limit. If we consider that the area of the curve is 100 percent, then the median is the exact midpoint of the distribution. The area below and above the median is both equal to 50 percent. Thus, if the median income is P20,000.00 this means that 50% of the households have an income below P20,000.00 and 50% of the households have an income above P20,000.00. On the other hand, the mean in our figure above indicates a high income of P 100,000. This makes the curve positively skewed. The value of the mean gives a distorted picture of reality. The value of the mean is being unduly influenced by few affluent income earners at the high end of the curve whose monthly income is almost around P 500,000.00. Looking at the modal income, which is P 10,000 per month, seemed also to distort the reality towards the low side. The mode is always the highest point of the curve. In this example, the mode represents the most frequently-earned income; it is far lower than the median income of P 20,000.00. Both the mean and the mode give a false portrait of distribution typicality and the truth lies somewhere in between. 232 MATHEMATICS IN THE MODERN WORLD Effects of the Scale of Measurement Used The scale of measurement in which the data are based oftentimes dictates the measures of central tendency to be used. The interval data can entertain the calculations of all three measures of central tendency. The modal and ordinal data cannot be used to calculate for the mean. Ordinal mean can provide an extremely confusing wrong result. Since median is about ranking, a rank above the score falls and a rank below a score falls; the ordinal arrangement is necessary in finding the median. For the nominal data, however, neither the mean nor the median can be used. Nominal data are restricted by simply using a number as a label for a category and the only measure of central tendency permissible for nominal data is the mode. In summary, if the interval data distribution is fairly well balanced, it is appropriate to use the mean to measure the central tendency. If the distribution of the interval data is skewed, you may either remove the outlier or adopt the median. If the interval data distribution manifests a significant clustering of scores, then consider to visually analyze the scores and find the presence of dominant constant which is the Mode. 233 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.2 1. A class of 13 students takes a 20-item quiz on Science 101. Their scores were as follows: 11, 11, 13, 14, 15, 18, 19, 9, 6, 4, 1, 2, 2. a. Find the mean. b. Find the median c. Find the mode. 2. A day after, the of 13 students mentioned in problem 1 takes the same test a second time. This time their scores were: 10, 10, 10, 10, 11, 13, 19, 9, 9, 8, 1, 7, 8. a. Find the mean. b. Find the median c. Find the mode. d. Was there a difference in their performance when taking the test a second time? 3. For the set of scores: 1000, 50, 120, 170, 120, 90, 30, 120. a. Find the mean. b. Find the median c. Find the mode. d. Which measure of central tendency is the most appropriate, and why? 234 MATHEMATICS IN THE MODERN WORLD Lesson 5.3 Measures of Dispersion Specific Objectives 1. To know the different measures of variability. 2. To comprehend the strengths of the three measures 3. To realize the effect of the measures in the distribution 4. To critically an select appropriate tool for a certain situation The measures of central tendency only provide information about the similarity or typicality of scores. But to fully describe the distribution, we need to gain information about how scores differ or vary. The description of the distribution can only be complete if some information of its variability is known. To substantiate the information provided by the measures of centrality, some degree of dispersion must also be brought into the light. Discussion Measures of Variability There are three measures of variability: the range, the standard deviation and the variance. These three measures give information about the spread of the scores in a distribution. Metaphorically, variability assert that a glass half-full is also half empty. Being half-full is about centrality and being half-empty is about variability. 235 MATHEMATICS IN THE MODERN WORLD The Range. The range, symbolized by R, describes the variability of scores by merely providing the width of the entire distribution. The range can be found by simply determining the difference between the highest score and the lowest score. This difference always has a single value answer. The example below shows the calculation of the range from a distribution of annual incomes: X =========================== Php 200,000.00 Highest Score 200,000.00 195,000.00 194,000.00 194,000.00 HS-LS =Range 194,000.00 193,000.00 200,000 –176,000 = 24,000 190,000.00 185,000.00 180,000.00 180,000.00 176,000.00 Lowest Score =========================== The capability of the range is to give information about the scattering of the scores by merely using two extreme points. But one the hand, capability of range to report score deviation poses a severe limitation. If you add new scores within the distribution, the range can never report any changes in the deviation. Also, just by adding one extreme score amidst normal distribution can definitely increase or decrease in range even if there are no other deviations that transpired within the distribution. The range is not stable enough to indicate variability. But nevertheless it is still a method in finding the variability of any given distribution. The Standard Deviation. The standard deviation (SD) is the life-blood of the variability concept. It provides measurement about how much all of the scores in 236 MATHEMATICS IN THE MODERN WORLD the distribution normally differ from the mean of the distribution. Unlike the range, which utilizes only two extreme scores, SD employs every score in the distribution. It is computed with reference to the mean (not the median or the mode) and it requires that the scores must be in interval form. A distribution with small standard deviation shows that the trait being measured is homogenous. While a distribution with a large standard deviation is indicative that the trait being measured is heterogeneous. A distribution with zero standard deviation implies that scores are all the same (i.e. 10, 10, 10, 10, 10). Although it may seem like stating the obvious, it is important to note that if all the scores are the same, there is no dispersion, no deviation, and no scattering of scores in the distribution --- so much so that there can never be less than zero variability. In calculating the standard deviation, we can either use the computational method or the deviation method. Both methods provide the same answer. However, in this lesson, we will use the computational method because it is designed for electronic calculators. The formula for computational method is provided below: The raw score in a distribution is symbolized as X തതതത The mean of a distribution is symbolized as 𝑋 The number of scores in a distribution is symbolized as N The formula simply states that the standard deviation (SD) is equal to the square root of the difference between the sum of raw score squared, which is divided by 237 MATHEMATICS IN THE MODERN WORLD the number of cases, and the mean squared (Sprinthall, 1994). Below is an example on how to obtain the standard deviation using the computational method. 434.283,000,000 𝑆𝐷 = √ − (190,083)2 12 𝑺𝑫 = 𝟕𝟔𝟓𝟑. 𝟓𝟐𝟏 Computer Note: Exploring MS Excel to find the value of SD 238 MATHEMATICS IN THE MODERN WORLD The concept of standard deviation can further be clarified by using an illustration of score distribution of students in Section A and in Section B, assuming that both distributions (Section A scores and Section B scores) have precisely the same measures of central tendency and the same range. The only unusual things about these two distributions is that they differ in terms of their standard deviations, Section A having a value that is greater than the value of Section B. The data are clearly shown in the figure below. Section A Section A Math Quiz Scores Scores Mean 100 100 Median 100 100 Mode 100 100 N 30 30 HS, LS, Range 130, 70, 60 130, 70, 60 SD 10 2 Frequency of occurrence Frequency of occurrence ----------------------------- ---------------------------- 0 70 100 130 0 70 100 130 Section A Section B Two Frequency Distributions of Scores As can be noticed in the figure above, there is just a slight bulge in the middle of the distribution of Section A. This means that it has many scores deviating widely from the mean (100) and this is the result of having a large standard deviation (10). However, Section B having a smaller standard deviation (2), most of the scores gathers closely around the mean (100) thereby creating a towering lump. These two distributions being compared reveals the disparity in the values of standard deviation between the two sections. The section A having a large standard 239 MATHEMATICS IN THE MODERN WORLD deviation, is behaving in a heterogenous manner while the section B having small standard deviation acting in a homogenous way. The Variance. Variance is another technique for assessing disparity in a distribution. In the simplest sense, variance is the square of the standard deviation. The formula is illustrated below: 𝑿is any raw score in 𝒙it is the deviation score. It is equal to the raw score, 𝑿, the distribution minus the mean, 𝑋ത : 𝑥 = 𝑋 − 𝑋ത Σ𝑋 2 Σ𝑥 2 𝑉= 𝑆𝐷2 = − 𝑋ത 2 = 𝑁 𝑁 Conceptually, variance is the same as standard deviation. If both standard deviation and variance manifest large values then it means heterogenous distribution and when they both manifest small values, they provide similar outcomes about the homogeneity of the distribution. While standard deviation finds out how to spread out the distribution scores from the mean by exploring the square root of the variance, the variance, on the other hand, calculates the average degree by which each score differs from the mean - i.e. the average of all the scores in the distribution. It may appear to be unnecessary to study variance where, in fact, standard deviation seems complete. But there are situations wherein it is more efficient to work directly with variances than to frequently make courtesy appearances to the standard deviation. In fact, F Ratio takes full utilization of this special property of variability. 240 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.3 1. At ABC University, a group of students was selected and asked how much of their weekly allowance they spent in buying mobile phone load. The following is the list of amounts spent: Php 120, 110, 100, 200, 10, 90, 100, 100. Calculate the mean, the range, and the standard deviation. 2. At XYZ University, another group of students was selected and asked how much of their weekly allowance they spent in buying mobile phone load. The following is the list of amounts spent: Php 200, 180, 30, 20, 10, 160, 150, 80. Calculate the mean, the range, and the standard deviation. 3. Consider the data in problems 1 and 2, in what way do the two distribution differ? Which group is more homogeneous? 241 MATHEMATICS IN THE MODERN WORLD Lesson 5.4 Measures of Relative Position Specific Objectives 1. To gain deeper understanding about the Z-score 2. To realize the important role of percentile, and quartile in a distribution 3. To interpret the analysis reported by box-and-whisker plots In the previous lesson, we have demonstrated two separate but related measures that can show the characteristics of the scores in a distribution. These are the measures of central tendency and the measures of variability. In this lesson, we can further explore all the possibilities that might occur in the relationship of centrality and variability (i.e., mean and standard deviation). Let us consider having two sets of distribution and different case scenarios that might occur in comparing their respective means and standard deviations. Discussion The z- Score Case A 𝜇1 = 𝜇2 𝜎1 = 𝜎2 As shown in Case A, it is possible that two distributions can generate almost the same means (𝜇) and almost the same standard deviations (𝜎). 242 MATHEMATICS IN THE MODERN WORLD Case B 𝜇1 ≠ 𝜇2 𝜎1 = 𝜎2 𝜇1 ≠ 𝜇2 𝜎1 = 𝜎2 It is also possible that two distributions have different means (𝜇) but similar standard deviations (𝜎). Case C 𝜇1 = 𝜇2 𝜎1 ≠ 𝜎2 Here in Case C, the two distributions have the same means (𝜇) but they differ in standard deviation (𝜎). Case D 𝜇1 ≠ 𝜇2 𝜎1 ≠ 𝜎2 In Case D, the distributions differ in terms of means (𝜇) and in terms of standard deviations (𝜎). This preliminary discussion basically shows that comparing two distributions is complex. Case scenarios must be considered. Sometimes two distributions differ in terms of means and sometimes they differs in terms of standard deviations. The 243 MATHEMATICS IN THE MODERN WORLD groups usually differ in terms of centrality as well as in terms of disparity. Thus, in order to compare two different groups, there must be a common scale that can reconcile both means and standard deviation in a single standard form. It is only when we convert scores from different distributions to common scores that direct comparison is possible. This common score being referred to is called the z-score. Below is the formula in finding the z-score. 𝑋−𝜇 𝑋−𝑥̅ 𝑧= 𝑧= 𝜎 𝑠 𝑋refers to the raw scores from the population. 𝑋refers to raw score from the sample 𝜇 pertains to the mean of the population ത pertains to the mean of the sample 𝑥 𝜎 population standard deviation 𝑠 estimated standard deviation Both formulas indicate the same relationship shared by the raw score, mean and standard deviation. The only distinction between the two formulas is that whether the distribution was generated from the population or from the sample. The formula in the left refers to the z-scores from the population while the formula in the right refers to the z-scores from the sample. 𝑋−𝜇 𝑧= 𝜎 The formula explains that values generated by the mean and standard deviation can be integrated to transform a raw score (𝑋) into a standard score (𝑧). The z- 𝑋−𝜇 score equation, 𝑧 = , can convert the raw score of any group into a common 𝜎 value and it enables comparison between scores coming from different group distributions. The below is an illustration of a standardized scale. As you may have noticed in this z-scaling, the mean is always zero and the standard deviation is always one unit. 244 MATHEMATICS IN THE MODERN WORLD 𝜇=0 𝜎=1 To further clarify the concept of z-score, let us assume that you are taking physics and biology courses. In your final examinations, you earned a grade of 95 in physics and 85 in biology. Now the question is: In which exam did you do better? It seems obvious based on the face value of the scores, that you did better in physics than in biology. But to come up with a serious comparison about your scores between the two tests, we must take into consideration the question about how well your classmates perform as a whole group. This requires additional information about the mean and standard deviation values of both physics and biology groups. But let us assume that we can right away get those needed information. As such: 𝜇 𝜎 (population mean) (population SD) Physics 85 10 Biology 75 5 Now, let us substitute that information into the z-score formula and compute for the z score values Physics Biology 𝑋𝑝 − 𝜇𝑝 𝑋𝑏 − 𝜇𝑏 𝑍𝑝 = 𝑍𝑏 = 𝜎𝑝 𝜎𝑏 95−85 85−75 𝑍𝑝 = = 1.0 𝑍𝑏 = = 2.0 10 5 245 MATHEMATICS IN THE MODERN WORLD Finally, let us place these z-score values into a z-scale to clearly illustrate the measures. 𝒁𝒑 =1.0 𝒁𝒃 =2.0 |_____|_____|_____|_____|_____|______|_____|_____| 𝒁 -4 -3 -2 -1 0 +1 +2 +3 +4 Physics 45 45 55 65 75 85 95 105 115 125 Biology 55 55 60 65 70 75 80 85 90 95 Notice that in the illustration, we can clearly compare the relative position of scores in one standardized scale. Notice also that the means of both subjects reconcile to adopt a common mean of 0 (𝜇 = 0). Likewise, both subjects agree to calibrate their standard deviations into a unit of one (𝜎 = 1). Thus, comparison can now be made on your final examination scores. As displayed, your score of 95 in physics falls directly below 1.0 on the z-scale. Your score of 85 in biology falls directly below 2.0 on the z-scale. It is clear that you did much better in the biology exam (𝑍𝑏 = 2.0) than what we previously thought that you did better in physics. This example is only a glimpse to show that standardized scores are the building blocks that provide the foundation to inferential statistics. Percentile To locate a specific point in any distribution, percentiles, quartiles and deciles are the tools that can be used. The relative position of the raw score can be described precisely by converting it into a percentile. A percentile refers to a point in the distribution below which a given percentage of scores fall. 246 MATHEMATICS IN THE MODERN WORLD 3rd percentile 97th percentile Based on the figure above, a score at the 97th percentile (P97) is at the very high end of the distribution because an enormous number (97%) of scores are below that point. A score at the 3rdpercentile (P3), however, is an extremely low score because only 3% of the scores are below that point. The figure above also show that the 50 th percentile divides the distribution exactly in half. The position of the 50th percentile is also the location of the median. To provide a better understanding on the role of the percentile, let us assume that your College Admission Test Result reflected the 97th percentile score. This does not indicate that out of 100 items of questions, you just made around three mistakes. Instead, it means that 97% of those who took the exam did not perform better than you. However, a significant 3% did perform better than you. The percentile of any given data value score (x) can be determined by dividing the number of data values less than x with total number of data values, and then multiplying the obtained result by 100. For instance, consider a College Admission Test administered to 5000 students, and your score of 800 was higher than the scores of 4000 examinees. With this information , we can determine the percentile of your score by using the formula the: 247 MATHEMATICS IN THE MODERN WORLD 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑒𝑠 𝑤ℎ𝑜 𝑑𝑖𝑑 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝑦𝑜𝑢 (4000) Percentile Score (x) = x 100 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑒𝑠 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 5000 = 80 Your score of 800 places you at the 80th percentile. Quartiles. As the name implies, quartiles divide the distribution into quarters. 3rd percentile 97th percentile Q1 Q2 Q3 The first quartile, Q1, is actually on the 25thpercentile. The second quartile, Q2, coincides with the median, which is on the 50th percentile. The 3rd Quartile, Q3, is on the 75th percentile. The Q can be determined by using the following procedures: For Q1. : The value of x is in the position.25 (n+1) For Q2: The value of x is in the position.50 (n +1) For Q3: The value of x is in the position.75 (n+1) 248 MATHEMATICS IN THE MODERN WORLD Let us consider this example and determine Q1, Q2, and Q3. X =========================== Php 200,000.00 200,000.00 195,000.00 194,000.00 193,000.00 192,000.00 191,000.00 190,000.00 185,000.00 181,000.00 180,000.00 176,000.00 =========================== First, make sure that the scores are arranged from highest to lowest. 1. Calculating for the 1st quartile (Q1) or the 25th percentile The x score is in the position of Q1 =.25 (n+1) Q1 =.25 (n+1) Q1 =.25 (12+1) Q1 = 3.25 Q1=182,000 The value of x corresponding to the position is 181,000 +.25 (185,000-181,000). Thus, Q1 = 182.000 249 MATHEMATICS IN THE MODERN WORLD 2. Calculating for the 2nd quartile (Q2) or the 50th percentile The x score is in the position of Q2 =.50 (n+1) Q2 =.50 (n+1) Q2=191,500 Q2 =.50 (12+1) Q2 = 6.5 The value of x corresponding to the position is 191,000 +.50 (192,000-191,000). Thus, Q2 =191,500 3. Calculating for the 3rd quartile (Q3) or the 75th percentile The x score is in the position of Q3 =.75 (n+1) Q3=194,750 Q3 =.75 (n+1) Q3 =.75 (12+1) Q3 = 9.75 The value of x corresponding to the position is 194,000 +.75 (195,000-194,000). Thus, Q3 = 194,750 Box-and-Whisker Plots A box and whisker plot displays a graphical summary of a set of data. It provides information about the minimum and the maximum scores in the distribution, the 1st Quartile and 3rdQuartile as well as the 2nd quartile or the median. Observe the figure below. 250 MATHEMATICS IN THE MODERN WORLD Now, let us find the five-point summary of our previous example. X =============== Php 200,000.00 HS 200,000.00 195,000.00 Q2 194,000.00 193,000.00 192,000.00 Median 191,000.00 190,000.00 185,000.00 Q1 181,000.00 180,000.00 176,000.00 HS ================ Box-and-Whisker plots are easy to construct and they outrightly show important information about the distribution of scores in a simple diagram. Also, it is not necessary to label the final product. |---|---|---|---|---|---|---|---|---|---|---| 251 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.4 1. You have taken final exams. Your score in science 101 was 80. Your score in math 101 was 95 n Σ𝑥 Σ𝑥 2 Science 101 120 7120 2800 Math 101 75 2275 325 a. Compute for the means of both classes. b. Compute for standard deviations of both classes c. Convert the final score into z-scores d. Plot the standard scores on a z-scale, include the appropriate raw score scale values for the two classes. e. In which class did you do better? Explain how did you analyze it. 2. The score of all students at ABC school were obtained. The highest score was 140, and the lowest score was 110. The following scores were identified as to their percentile: __________________________ X Percentile -------------------------------------------------- 112 10th 119 25th 123 50th 127 75th 134 90th a. What is the range of the distribution? b. What is the median? c. What is the 1st quartile, 2nd quartile, 3rd quartile? Figure out---What is the interquartile range? d. What is the interdecile range? 252 MATHEMATICS IN THE MODERN WORLD 3. The data given are the calories per 200 milliliters of popular sodas. 21,18,21,20,26,31,18,16,25,27,13,27,36,24,25 a. Find the 25th percentile. b. Find the median c. Find the 75th percentile d. Construct a box plot for these data. 253 MATHEMATICS IN THE MODERN WORLD Lesson 5.5 The Normal Distributions Specific Objective : 1. To understand the concept of normal distribution 2. To gain knowledge on how use the z-table efficiently 3. To identify and classify some situation pertaining normal distribution 4. To understand the applicability of normal distribution in real life. If mean and standard deviation are heart and brain of descriptive statistics then perhaps the normal curve is its lifeblood. In the preceding section, we discussed in passing the z-scores, wherein the mean is always zero and the standard deviation is fixed to 1. In this section, it is now proper to finally introduce the normal curve. The normal curve is actually a theoretical distribution. It is a unimodal frequency distribution curve. The scores are scattered on the X axis while the frequency of occurrence is defined by the Y axis. 254 MATHEMATICS IN THE MODERN WORLD Discussions Here are some key characteristics of the normal curve. 1. Majority of the scores cluster around the middle of the distribution and fewer scores scattered in both extreme sides or tail ends of the curve. 2. It is always symmetrical and perfectly balanced. 3. Being a theoretical distribution, the mean, median and the mode are all equal. 4. It uses standard deviation along the x-axis. 5. The normal curve is asymptotic to the abscissa and the total area under the curve is approximating 1.0 or 100% 6. The normal curve has a mean of zero and standard deviation of 1 unit. The Empirical Rule for a normal distribution 68% of data within 1 sd 95% of data within 2 sd 99.7% of data within 3 sd 255 MATHEMATICS IN THE MODERN WORLD z Scores. The z scores are enormously beneficial in interpreting of relative position of the raw score taking into account the centrality of the distribution and the amount of variability. With the z-score, we can gain understanding of an individual relative performance compared to the performance of the entire group being measured. But before we delve deeper into the concepts of the z score, it is imperative to learn how to use the z-score table. A copy of the z-table can be accessed at this website address: https://www.calculator.net/z-score-calculator.html https://www.calculator.net/z-score-calculator.html https://www.calculator.net/z-score-calculator.html The table we will be using is a right tail z-table. This table is used to find the area between z=0 and any positive value and reference the area to the right side of the standard deviation curve. The z-score table gives only the percentage for the half of the curve. But since the normal curve is symmetrical, a z-score that is given to the right of the mean yields the same percentage as a z score to the left of the mean Mean line For example, to look up a z-score of.68 using the z-score table, look for 0.6 in the far left of the column then look for the second decimal 0.08 in the top row. The table value is 0.25175. It represents a percentage of 25.17 %. It is the percentage of cases falling between the z score and the mean. 256 MATHEMATICS IN THE MODERN WORLD 25.17 % is the area between the z-score and the mean Mean 0.68 Z score 25.17% is the percentage of cases falling between the z score (0.68) and the mean. Now, let us consider some situations that might possibly occur in using the z-table Case 1. Finding percentage of cases falling between z-score and the mean. This area is 24.215% This area is 24.215% - Z score Mean Mean + Z score 257 MATHEMATICS IN THE MODERN WORLD As example for Case 1, the z-score of +0.75 will generate a z-table value of 0.24215 or 24.215%. In the same way, the z-score of -0.75 will generate the same value-table value of 0.24215 or 24.215%. Notice that the value is always a positive number since percentage area is always positive. Case 2. Finding the percentage of cases above the given z-score. It is important to remember for this case that the total area of the normal curve is 1.0 or 100%. It is also essential to keep in mind that the right half of the normal curve is 50% as well as the left half (50%). You also need to consider that the z-table always provide a percentage value in relation to the mean. This area is 24.215% This area is 50% This area is 25.785% +0.75 -0.75 Mean ++ Z score - Z score Mean (a) (b) For Case 2(a), To find the area above the given z-score, the equivalent z-table value must be determined then subtract it from the total area of the right half which is 50%. For example, to find the percentage of cases above the z-score of +0.75. Find the z-table value of +.75 which is 0.24215 (24.215%) then subtract it from the total area of the right half of the normal curve which is 50%. This is 50% - 24.214% = 25.785% For Case 2(b), in order the determine the area above the given z-score (the z-score here is a negative number because it is situated in the left side of the normal curve) , simply find the equivalent z-table value then add 50%. Again, always keep in mind 258 MATHEMATICS IN THE MODERN WORLD that the z-table only provide a percentage of cases between the z-score the mean and not the entire right side of the curve. To cite another example, let us find the percentage of cases above the z-score of -0.75. The z-table value of -0.75 is 0.24215. This is equivalent to 24.215%. With this number just add the percentage area of the entire right side which is 50%. So this is 24.215% + 50% =74.215%. Case 3. Finding the percentage of cases below the given z-score. The principle we made in Case 2 is the same principle that can be applied in Case 3. This area is 25.785% This area is 50% This area is 24.215% -0.75 +0.75 Mean -Z score Mean + Z score (a) (b) For case 3(a), try to determine the percentage of cases below the z-score of -0.75. Using similar analysis made in case 2(a), the total area of the left side must be subtracted. If your computation is correct, your answer is 25.785%. For case 3(b), to determine the percentage of cases below the z-score of +0.75. The z-table value will only cover the percentage of cases between the z-score and the mean, so you need to add 50% which is the l percentage of cases of the left side of the normal curve. Your computation must generate an answer of 74.215%. 259 MATHEMATICS IN THE MODERN WORLD Case 4. Finding the percentage of cases between the two z-scores. This area is 24.215% This area is 24.215% -0.75 +0.75 Mean -Z score +Z score To illustrate Case 4, let us try to determine the percentage of cases between the two z-scores. The -0.75 Z-score and +0.75 z-score. The -0.75 z-score generates a z-table value of 24.215%. Also +0.75 z-score generates the same z-table value of 24.215%. Thus, the percentage of cases between -0.75 and +0.75 is simply to add the two percentage of cases and that is (24.215% + 24.215%) 48.43%. Translating the raw score into the z-score. We are now familiar with the z-score concepts and having a knowledge about percentages of area above, below and between z-scores. Likewise, we are also equipped with certain knowledge regarding the z-score formula that if the mean and standard deviation are known, we can subtract the mean from the raw score, divide by standard deviation, and obtain the z score. 𝑥−𝑥̅ 𝑧= 𝑆𝐷 The z-score reveals the location of the raw score from the mean in the standard deviation units. The z score accounts both the mean of the distribution and the amount of variability. Now, let us determine the practical use z-score in the context of normal distribution of raw scores. 260 MATHEMATICS IN THE MODERN WORLD Case A. When the percentage of cases is between the raw score and the mean. The normal distribution of physics scores has mean of 85 and a standard deviation of 10. What percentage of scores will fall between the physics score of 95 and the mean? Initially, we need to convert the raw score of 95 into its equivalent z-score. 𝑥−𝑥̅ 95−85 𝑧= = = 1.0 𝑆𝐷 10 Then draw the normal curve as shown below; 34.13% 85 95 𝑋ത (1.00) Next is to look up the z-score value in the table ( https://www.calculator.net/z- score-calculator.html ). The z-table value is 0.34134 or 34.13%. That is the percentage of scores that falling between the physics score of 95 and the mean. This means that around 1 in 3 students (34.13%) fall between the score of 95 and the mean. Case B. When the percentage of cases fall below a raw score. Using the same example, on a normal distribution of scores in physics class, with a mean of 85 and a standard deviation of 10, what percentage of physics scores fall below a score of 95? First, convert the raw score of 95 into its equivalent z-score. 𝑥−𝑥̅ 95−85 𝑧= = = 1.0 𝑆𝐷 10 Next is to draw the normal curve as already shown below; 261 MATHEMATICS IN THE MODERN WORLD 34.13% 50% 85 95 𝑋ത (1.00) Finally, look up the z-score in the z- table ( https://www.calculator.net/z-score- calculator.html )take the right value. It is 0.34134 or 34.13%. Lastly, add the 50% to 34.13% to get the sum 84.13%. The percentage of physics scores fall below a score of 95 is 84.13%. This means that if 100 students took the examination and your score is 95. Then your physics grade surpassed the grade of 84 students. Case C. When the percentage of cases is above a raw score. On a normal distribution of scores in physics class, with a mean of 85 and a standard deviation of 10, what percentage of physics scores above a score of 95? Again, we need to convert the raw score of 95 into its equivalent z-score. 𝑥−𝑥̅ 95−85 𝑧= = = 1.0 𝑆𝐷 10 The draw the normal curve as already shown below; This area is 15.87% 34.13% 85 95 𝑋ത (1.00) 262 MATHEMATICS IN THE MODERN WORLD We look up the z-score in the table ( https://www.calculator.net/z-score- calculator.html )take the correct value. It is 0.34134 or 34.13%. Then subtract 34.13% from 50%. The answer is 15.87%. This is the percentage of cases above the score of 95. This means that if 100 students took the examination and your score is 95. Then around 15 students surpassed your physics grade of 95. Case D. When the percentage of cases is between raw scores. On a normal distribution of physics scores, the mean is 85 and the standard deviation is 10. Your physics score is 95 and your friends score is 80. You wanted to determine how many students got a score between your friend’s score of 80 and your score of 95. Again, convert the raw score of 95 and the raw score of 80 into its equivalent z- scores. 𝑥−𝑥̅ 95−85 𝑥−𝑥̅ 80−85 𝑧= = = 1.0 𝑧= = = - 0.5 𝑆𝐷 10 𝑆𝐷 10 The draw the normal curve as already shown below; 34.13% 19.15% 80 85 95 (-0.5) 𝑋ത (1.00) We look up the z-score in the table ( https://www.calculator.net/z-score- calculator.html ) and look for z percentage of cases for the z-value 1.0. Also look for the percentage of cases for the z-value -0.5. The percentage of cases is 34.13% and 19.15% respectively. Add the two values to get the percentage of cases between 263 MATHEMATICS IN THE MODERN WORLD the raw score of 95 and 80. The answer is 53.28%. This means that 1 in 2 students got a score between 95 and 85 (i.e. between your score and your friend’s score). At this point, we already made a significantly long journey. From the measures of central tendency to the measures of variability and finally to measures of relative position. We are now in the position no longer seeking answers to questions but seeking questions beyond the conventions established by the answers. 264 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.5 1. Road test of MG5 Sedan compact car show a fuel mean rating of 20 kilometers per liter in highways, with a standard deviation of 1.5 kilometers per liter. What percentage of these cars (MG5) will achieve results of a. More than 25 kilometers per liter? b. Less than 17 kilometers per liter? c. Between 15 and 24 kilometers per liter? d. Between 21 and 24 kilometers per liter? 2. On a normal distribution, at what percentage must a. The mean fall? b. The median fall? c. The mode fall? 265 MATHEMATICS IN THE MODERN WORLD Lesson 5.6 The Linear Correlation: Pearson r Specific Objective 1. To know the characteristics of Pearson r 2. To solve problems dealing with linear correlations 3. To understand the limitations of linear correlations At the beginning of this course, we defined mathematics as the science of patterns. We realized that nature follows a certain kind of mathematical structure as we observed some patterns and irregularities and whenever we see patterns, irregularity also beg also to be noticed. Also, whenever we see irregularities, some patterns suddenly waving for attention. The linear correlation is not about patterns, but it is about looking on irregularities and patiently waiting for the patterns to manifest. This lesson deals with determining the connections of the things seemed unrelated and to declare whether some correlations are indeed significant. Discussions The Pearson R Linear Correlation. The Product-Moment Correlation Coefficient or Pearson r is an statistical tool that can determine the linear association between two distributions or groups. This tool can only establish the strength of association or 266 MATHEMATICS IN THE MODERN WORLD correlation but it can never justify any causal relation that may appear or seemed obvious. The formula below is the computational method for calculating the Pearson r The number of subjects Means Σ𝑋𝑌 −(𝑥̅ )(𝑦ത) 𝑁 𝑟= 𝑆𝐷𝑥 𝑆𝐷𝑦 Standard Deviations The pearson r value may provide three possible scenarios. If the value of 𝑟 is + then it is a positive correlation. If it is - then it is a negative correlation. If 𝑟’s value is around “0” then it means that almost no linear correlation found. 𝒓 = +𝟏 𝒓 = −𝟏 𝒓=𝟎 An example of positive correlation is height and weight of a person. Under normal circumstances whenever a person gain height it means also a gain in weight. An example of negative correlation is the relationship between length of employment and degree of attractiveness. As you may observe physically attractiveness of an employee is affected by the chronologically advancement of his or her age. An 267 MATHEMATICS IN THE MODERN WORLD example of zero correlation might be relationship between grade of student living in high land areas and the study habits of students living in the low land areas. You should also remember that Pearson 𝑟 does not generate a value less than -1 or more than +1. Any answer outside below -1 and above +1 can be attributed to a wrong computation made. We will explain the nature of linear correlation by using an example. Assuming that we want to determine if there is a correlation between hours of study and grades of students last semester. Initially, we need to randomly select students (let say 10) and ask them about their averaged grade last semester as well as the number of hours they spent in studying per week in that semester. Let us presume that right away they provided us these two informations. =============================================== Student Hours of Study (x) Grade (Y) =============================================== A 15 2.75 B 35 1.25 C 05 3.00 D 20 2.50 E 30 1.50 F 40 1.00 G 20 2.25 H 25 1.75 I 25 2.00 J 08 3.00 But before we can immediately use the Pearson r formula, we need to ensure that this is the correct statistical tool in determining the correlation between hours of study and grades. Let us check some basic Pearson r requirements: 268 MATHEMATICS IN THE MODERN WORLD 1. Random selection of participants. 2. Traits being measured must not depart significantly from normality 3. The measurements on both distributions must be in the form of interval data. 4. Comparing only two groups. 5. And the goal is to determine the linear correlation between two groups. The formula in solving the Pearson 𝑟 is… Σ𝑋𝑌 𝑁 −(𝑥̅ )(𝑦ത) 𝑟= 𝑆𝐷𝑥𝑆𝐷𝑦 𝑋 refers to one variable and the 𝑌 refers to another variable 𝑋ത 𝑎𝑛𝑑 𝑌ത refers to the mean of 𝑋 and the mean of 𝑌 𝑆𝐷𝑥 and 𝑆𝐷𝑦 refers to the standard deviation of 𝑋 and 𝑌 respectively 𝑁 refers to the numbers of variables Σ It is the symbol for summation Now let us take into account the data below as our example to illustrate the formula. ================================================================================= Student Hours of Study (𝑥) 𝑥2 Grade (𝑦) 𝑦2 𝑥𝑦 ================================================================================== A 15 225 2.75 7.56 41.25 B 35 1225 1.25 1.56 43.75 C 05 25 3.00 9.00 15.00 D 20 400 2.50 6.25 50.00 E 30 900 1.50 2.25 45.00 F 40 1600 1.00 1.00 40.00 G 20 400 2.25 5.06 45.00 H 25 625 1.75 3.06 43.75 I 25 625 2.00 4.00 50.00 J 08 64 3.00 9.00 24.00 ==================================================================================== 𝚺𝒙=223 𝚺𝒙𝟐=6089 𝚺𝐲=21 𝚺𝒚𝟐 = 48.75 𝚺𝒙𝒚=397.75 269 MATHEMATICS IN THE MODERN WORLD Σ𝑥 223 Σy 21 𝑥̅ = = = 22.3 𝑦ത = = 10 = 2.1 𝑁 10 𝑁 Σ𝑥 2 6089 Σ𝑦 2 48.75 𝑆𝐷𝑥 = √ − 𝑥̅ 2 = √ − 22.32 = 10.56 𝑆𝐷𝑦 = √ − 𝑦ത 2 = √ − 2.12 =0.682 𝑁 10 𝑁 10 Σ𝑋𝑌 𝑁 −(𝑥̅ )(𝑦ത) 𝑟= 𝑆𝐷𝑥𝑆𝐷𝑦 397.75 − (22.3)(2.1) 𝑟 = 10 (10.56)(0.682) 𝒓= -0.979 Point to Ponder: Why do you think we generated a negative r value? Thus, we could say that the correlation between hours of study and grades of students achieved a Pearson r value of -0.979. Do not be confused by the that there is a negative sign in our final answer. This sign provides an idea of the direction of correlation line. You should take into consideration that a grade of 1.0 has a strong academic weight in our grading system but once plug in into the computation it is interpreted by formula as a small number. Nevertheless, with full knowledge of the concept you can always come up with the right interpretation. Since the distribution exclusively concerns the 10 students and it is not a population sample, then Guilford’s suggested interpretation for the values of r can be used without hindrance. Guilford’s Interpretation for the values of r r value Interpretation =============================================================== Less than.20 Almost negligible relationship.20-.40 Definite but small relationship.40-.70 Substantial relationship.70-.90 Marked relationship.90-1.00 Very dependable relationship 270 MATHEMATICS IN THE MODERN WORLD =============================================================== And based on Guilford’s suggested interpretation, there is a very dependable relationship between hours of study and grade of students. Does it mean that better grades can be achieved by spending more time studying? Does it mean that spending more time studying is a by-product of better grades? Does it mean that another factor influenced better grades and study habits? All three of these questions are possible. But the point is that correlation alone is not enough to identify which is the real explanation. Pearson r is not a tool for establishing causation. It can only a tool describe linear correlation between to observed traits. 271 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.6 Seven randomly selected participants were given both math and music tests. Their scores are as follows: ==================== Math Music ==================== 16 14 6 7 17 15 11 14 12 12 4 6 13 11 Is there math ability related to their music ability? 272 MATHEMATICS IN THE MODERN WORLD Lesson 5.7 The Least-Squares Regression Line Specific Objective : 1. Define Linear Regression 2. Define Scatter plot 3. Compute for Least Square Regression Line In the previous lesson, we discussed Pearson r as a powerful tool in determining linear correlation. It is an important tool to investigate associations considering that different mathematical patterns are all around us. Such as, the connection of high tide and low tide in human behavior, the association between height and weight. And the correlation between the metaphoric flap of butterfly in Japan to a weather disturbance in South America a year after. But correlation entrapped and cloistered us within the parameter of merely associating. Correlation in and by itself cannot establish causation to warrant prediction. But in this lesson of regression analysis, not only that we can connect and associate some observable patterns, it also permits us finally make basic predictions. Discussion Bivariate Scatter Plot A bivariate simply means that we can graphically represent two variables (x and y) in a scatter plot wherein each point in a scatter plot represent a pair of scores. Scatter plot is necessary in order to determine the regression line. The regression line a generated straight line that lies closest to all the point in the scatter plot. 273 MATHEMATICS IN THE MODERN WORLD Our example below illustrates the construction of scatter plot based on some data information regarding the association of our previous example on hours of study and grade. 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 𝑑7 𝑑8 𝑑9 𝑑10 As shown in the scatter plot above, the straight line is called the least-squares regression line. This generated line minimizes the sum of the squares of the vertical deviation from each data point to the line. This means that of all the possible lines that can suggest the correlation line strength of all the points, the equation of this generated line has the best fit. The 𝑑𝑛 represents the distance from point (x,y) to the line. 𝑑12 + 𝑑22 + 𝑑32 + 𝑑42 + 𝑑52 + 𝑑62 + 𝑑72 + 𝑑82 + 𝑑92 + 𝑑10 2 In the least-squares line, this correlation that can be established around the regression line is the basis for resulting prediction. But in order to make predictions, three important ingredients must be on hand: 1. The equation of the best fit line. 2. Slope of the line, and 3. The y-intercept of the line. The Formula for the Least-Squares Regression Line There must be 𝑛 ordered pairs: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ) 274 𝑦 = 𝑚𝑥 + 𝑏 𝑛(Σ𝑥𝑦)−(Σ𝑥)(Σ𝑦) (Σ𝑦)−𝑚(Σ𝑥) 𝑚= 𝑏= 𝑛(Σ𝑥 2 )−(Σ𝑥)2 𝑛 MATHEMATICS IN THE MODERN WORLD To apply this formula to our given data, we need to find the value of each summation. In finding the value of 𝑚 : 𝑛(Σ𝑥𝑦)−(Σ𝑥)(Σ𝑦) 10(397.75)−(223)(21) 𝑚= = = -0.06321 𝑛(Σ𝑥 2 )−(Σ𝑥)2 10(6089)−49729 In finding the value of 𝑏 : (Σ𝑦)−𝑚(Σ𝑥) (21)−(−0.06321)(223) 𝑏= = =3.509 𝑛 10 Finally, substituting the values to the given formula: 275 MATHEMATICS IN THE MODERN WORLD 𝒚𝒑𝒓𝒆𝒅 = 𝒎𝒙 + 𝒃 𝒚𝒑𝒓𝒆𝒅 = -0.06321x + 3.509 Slope(𝒎) = -0.06321 𝒚 intercept (𝒃) = 3.509 In the preceding lesson, we were able to establish the strength of correlation of this example using Pearson r. We found a very strong relationship between hours of study (𝑥) and grade (𝑦) (𝑖. 𝑒. 𝟎. 𝟗𝟕𝟗). Now let us predict the grade of students who spent the following weekly study hours: 37, 22, and 8. Since we have already determined the regression the line, let us just simply plug all the necessary values then “𝑦”. ================================================================ 𝒚𝒑𝒓𝒆𝒅 = 𝒎𝒙 + 𝒃 𝒚𝒑𝒓𝒆𝒅 = -0.06321x + 3.509 ================================================================ 𝒙 − 𝟎. 𝟎𝟔𝟑𝟐𝟏𝒙 − 𝟎. 𝟎𝟔𝟑𝟐𝟏𝒙 + 𝟑. 𝟓𝟎𝟗 = 𝒚𝒑𝒓𝒆𝒅 37 -2.33877 1.17023 22 -1.39062 2.11838 08 -0.50568 3.00332 ================================================================ The predicted grade of students is around 1.17 for the student who spends 37 hours of study, 2.12 for the one spending 22 hours of study, and just a passing grade of 3.0 for the one engaged for eight hours of study. 276 MATHEMATICS IN THE MODERN WORLD Learning Activity 5.7 The research office is interested in the possible relationship anxiety and aptitude scores of randomly selected eight students; they are given both the anxiety test and aptitude test. Their weighted, scaled scores are as follows: ============================================ Subjects Aptitude Test Anxiety Test ============================================ A 10 12 B 7 9 C 13 14 D 8 7 E 11 11 F 6 7 G 10 12 H 11 10 ============================================ a. What is the correlation between anxiety and aptitude scores. b. If one of the students receives a score of 12 on the aptitude test, what is your best estimate of the score that student will on anxiety? 277 MATHEMATICS IN THE MODERN WORLD Module Five : Project Proposal Requirement Project Proposal Requirement For this culminating requirement in Module Five, you need to work together in groups of 3 or 4. 1. Your task is to prepare a proposal study that can contribute to a solution to any social problem. 2. You must use statistical methods for your data processing and analyses. 3. Your final output must be no more than 8 pages that details your project proposal. 4. Please follow the outline provided below: a. Title page (not included in the page count) -An example of problem to be addressed: In this COVID-19 pandemic, how can we reduce human traffic in wet market places. b. Background and Statement of the Problem c. Literature Review d. Proposed Study with emphasis on how statistics will be used i. Data to be collected ii. Methods of data collection and data gathering instrument iii. Data gathering procedure iv. Method of Analyses e. Discussion of how your project proposal can address the identified problem. f. References (APA or MLA) 5. Below is the format guideline: Paper Font Font All Line Page Substance Margin Orientation Paper Size Number Type Size Spacing (if printed) 20 Normal Portrait 8.5 x 13 Arial 12 1.5 Page x of x Justified Your project proposal will be graded based on these criteria: 1. Soundness of the proposal (1/3) 2. Appropriate use of statistical method (1/3) 3. Coherence (1/3) 278 MATHEMATICS IN THE MODERN WORLD Chapter Test 5 Multiple Choice. Choose the letter of the correct answer and write it on the blank provided at the left side of the test paper. ================================================================== __________ 1. It is a branch of statistics that deals with data analysis and one of its technique is to “describe” data in symbolic form and abbreviated fashion. a. Inferential Statistics c. Descriptive Statistics b. Statistics and Probability d. Probability __________ 2. It is a branch of statistics that has the ability to “infer” and to generalize. It is also the right tool to predict values that are not really known. a. Inferential Statistics c. Descriptive Statistics b. Statistics and Probability d. Probability __________ 3. It is an essential quantifying an observation according to a certain rule. It is also assigning numbers in a prescribed way. a. Variable c. Data b. Measurement d. Constant __________ 4. If the data are labelled 1st, 2nd, 3rd, and so on, in what kind of scale does it falls? a. Nominal Scale c. Categorical Data b. Interval Scale d. Ordinal Scale __________ 5. Personal Biodata falls in what kind of scale? a. Nominal Scale