Descriptive Statistics Handout PDF - Statistical Methods, Data Collection | University of the Philippines Visayas

University of the Philippines Visayas College of Arts and Sciences DIVISION OF PHYSICAL SCIENCES AND MATHEMATICS Unit 1 Descriptive Statistics 1.1 Basic Terms and Concepts Fields of Statistics 1. Statistical Methods of Applied Statistics - refer to procedures and techniques used in the collection, presentation, analysis, and interpretation of data. Descriptive Statistics - methods concerned with the collection, description, and analysis of a set of data without drawing conclusions or inferences about a larger set - the main concern is simply to describe the set of data such that otherwise obscure information is brought out clearly - conclusions apply only to the data on hand Inferential Statistics - methods concerned with making predictions or inferences about a larger set of data using only the information gathered from a subset of this larger set - the main concern is not merely to describe but actually predict and make inferences based on the information gathered - conclusions are applicable to a larger set of data which the data on hand is only a subset 2. Statistical Theory of Mathematical Statistics - deals with the development and exposition of theories that serve as bases of statistical methods. Population and Sample Definition. A population is a collection of all the elements under consideration in a statistical study. Definition. A sample is a part or subset of the population from which the information is collected. Example: A manufacturer of kerosene heaters wants to determine if customers are satisfied with the performance of their heaters. Toward this goal, 5,000 of his 200,000 customers are contacted and each is asked, “Are you satisfied with the performance of the kerosene heater you purchased?” Identify the population and sample. In this example, the population of interest consists of all customers who purchased the kerosene heater while the sample consists of those who were contacted to give their level of satisfaction on the product they purchased. There are 200,000 customers in the population and there are 5,000 customers in the sample. Note that the course pack provided to you in any form is intended only for your use in connection with the course that you are enrolled in. It is not for distribution or sale. Permission should be obtained from your instructor for any use other than for what it is intended. Definition. A parameter is a numerical characteristic of the population. Definition. A statistic is a numerical characteristic of the sample. Example: In order to estimate the true proportion of students at a certain college who smoke cigarettes, the administration polled a sample of 200 students and determined that the proportion of students from the sample who smoke cigarettes is 0.12. Identify the parameter and the statistic. The population consists of all students at the college of interest while the sample consists of those 200 students who were asked if they smoke cigarette or not. The parameter in this case is the proportion of students in the college who smoke cigarette while the statistic is the proportion of students in the sample who smoke cigarette. The value 0.12 is a realized value of the statistic. Definition. A variable is a characteristic or attribute of persons or objects which can assume different values or labels for different persons or objects under consideration. Definition. A measurement is the process of determining the value or label of a particular variable for a particular experimental or sampling unit. Definition. An experimental unit or sampling unit is the individual or object on which a variable is measured. Definition. An observation is a numerical recording of information on a variable. Definition. Data is a collection of observations. 1.2 Data Collection Classification of Statistical Data 1. Primary vs. Secondary Primary data – data measured by the researcher/agency that published it Secondary data – any republication of data by another agency Example: The publications of the Philippine Statistics Authority are primary data and all subsequent publications of other agencies are secondary. 2. External vs. Internal Internal data - information that relates to the operations and functions of the organization collecting the data External data - information that relates to some activity outside the organization collecting the data Example: The sales data of SM is internal data for SM but external data for any other organization such as Robinson’s. Stat 111 Handout 1 2 Sources of Statistical Data General Classification of Collecting Data Definition. Census or complete enumeration is the method of gathering the information of interest or pertinent data from every unit in the population. not always possible to get timely, accurate and economical data costly, especially if the number of units in the population is too large Definition. Sample survey is the method of gathering data from a small but representative cross-section of the population. Advantages of a Sample survey over a Census (on a large population) Speed and timeliness - data on the population can be gathered faster ensuring uniformity Economy - information gathering and data analysis is cheaper Quality and accuracy - when properly conducted, a sample survey usually yield more accurate results since a small highly skilled group of workers is likely to make fewer errors in the collection and handling of data than a large census force would Feasibility - some data gathering methods require the destruction of a unit to obtain the required information, e.g. lifetime of a bulb Methods of Collecting Statistical Data 1. Survey method – questions are asked to obtain information There are several ways of administering a survey, including: a. Telephone Interview three types: - traditional telephone interviews - computer assisted telephone dialing - computer assisted telephone interviewing (CATI) b. Mailed Questionnaire c. Online survey d. Personal in-home survey e. Personal mall intercept survey Stat 111 Handout 1 3 2. Observation method – makes possible the recording of behavior but only at the time of occurrence (e.g., observing reactions to a particular stimulus, traffic count) 3. Experimental method - a method designed for collecting data under controlled conditions. 4. Use of existing studies - e.g., census, health statistics, and weather bureau reports Two types: documentary sources - published or written reports, periodicals, unpublished documents, etc. field sources - researchers who have done studies on the area of interest are asked personally or directly for information needed 5. Registration method - e.g., car registration, student registration, and hospital admission Sampling and Sampling Techniques Definition. A sampling procedure that gives every element of the population a (known) nonzero chance of being selected in the sample is called probability sampling. Otherwise, the sampling procedure is called non-probability sampling. Remark: Whenever possible, probability sampling is used because there is no objective way of assessing the reliability of inferences under non-probability sampling. Definition. The target population is the population from which information is desired. Definition. The sampled population is the collection of elements from which the sample is actually taken. Definition. The population frame is a listing of all the individual units in the population. Methods of Collecting Statistical Data Non-probability Sampling Methods 1. purposive sampling – sets out to make a sample agree with the profile of the population based on some pre-selected characteristics 2. quota sampling – selects a specified number (quota) of sampling units possessing certain characteristics 3. convenience sampling – selects sampling units that come to hand or are convenient to get information from 4. judgment sampling – selects sample in accordance with an expert’s judgment 5. snowball/referral sampling – the researcher asks the initial subject to identify another potential subject who also meets the criteria of the research Stat 111 Handout 1 4 Probability Sampling Methods 1. Simple random sampling 2. Stratified random sampling 3. Systematic sampling 4. Cluster sampling 5. Multistage sampling 1. Simple Random Sampling Simple random sampling (SRS) is a sampling technique wherein every item of the population has an equal and likely chance of being selected in the sample. Random sampling may be with replacement (SRSWR) or without replacement (SRSWOR). In SRSWR, a chosen element is always replaced before the next selection is made, so that an element may be chosen more than once. Approaches: Method of Lottery Use of Random Numbers ⎯ Table of Random Numbers ⎯ Online Random Number Generator 2. Stratified Random Sampling In stratified random sampling, the population of N units is first divided into subpopulations called strata. Then a simple random sample is drawn from each stratum, the selection being made independently in different strata. 3. (1-in-k) Systematic Sampling Systematic sampling with a “random start” is a method of selecting a sample by taking every kth unit from an ordered population, the first unit being selected at random. Here k is called the sampling interval and k = N/n; the reciprocal 1/k is the sampling fraction. Stat 111 Handout 1 5 4. Cluster Sampling Cluster sampling is a method of sampling where a sample of distinct groups, or clusters, of elements is selected and then a census of every element in the selected clusters is taken. Similar to strata in stratified sampling, clusters are non-overlapping sub-populations which together comprise the entire population. Unlike strata, however, clusters are preferably formed with heterogeneous, rather than homogeneous elements so that each cluster will be typical of the population. Clusters may be of equal or unequal size. Natural grouping within population All units in the sampled sections 5. Multistage Sampling In multistage sampling, the population is divided into a hierarchy of sampling units corresponding to the different sampling stages. In the first stage of sampling, the population is divided into primary stage units (PSU) then a sample of PSUs is drawn. In the second stage of sampling, each selected PSU is subdivided into second-stage units (SSU) then a sample of SSUs is drawn. The process of subsampling can be carried to a third stage, fourth stage and so on, by sampling the subunits instead of enumerating them completely at each stage. Stat 111 Handout 1 6 Types of Allocation Equal Allocation used when the total number of units in the strata are more or less the same when the stratum variability and cost per sampling unit do not vary much from stratum to stratum when there is no prior knowledge of stratum variability or cost per sampling unit Proportional Allocation used when the stratum sizes vary from stratum to stratum Optimum (Neyman) allocation used when the stratum variability or stratum proportion is expected to vary from stratum to stratum Allocation Method Allocation Formula 𝑛 𝑛ℎ = ; ℎ = 1,2, … , 𝑔 where 𝑔 g = no. of strata Equal Allocation n = sample size nh = hth stratum sample size 𝑁ℎ 𝑛ℎ =.𝑛 ; ℎ = 1,2, … , 𝑔 where 𝑁 Proportional Allocation Nh = hth stratum population size N = population size n = sample size 𝑁 ℎ 𝑠ℎ 𝑛ℎ = 𝑛. ∑ 𝑁 ℎ 𝑠ℎ ; ℎ = 1,2, … , 𝑔 Neyman Allocation Nh = hth stratum population size n = sample size 𝑠ℎ = hth stratum standard deviation 𝑔 Note: 𝑛 = ∑ℎ=1 𝑛ℎ Stat 111 Handout 1 7 1.3 Data Presentation Textual Presentation data incorporated into a paragraph of text Example: “Among the country’s 18 administrative regions, the most densely populated was the National Capital Region (NCR), with a population density of 20,785 persons per square kilometer. This figure is more than 60 times higher than the population density of 337 persons per square kilometer at the national level. This translates to an additional 1,648 persons per square kilometer (8.6 percent) from the 19,137 persons per square kilometer in 2010. The population density of the NCR in 2000 was 16,032 persons per square kilometer.” Tabular Presentation the systematic organization of data in rows and columns Parts of a Formal Statistical Table 1. Heading – consists of a table number, title, and headnote. The title is a brief statement of the nature, classification and time reference of the information presented and the area to which the statistics refer. The headnote is a statement enclosed in brackets between the table title and the top rule of the table that provides additional title information. 2. Box Head – the portion of the table that contains the column heads which describe the data in each column, together with the needed classifying and qualifying spanner heads. 3. Stub – the portion of the table usually comprising the first column on the left, in which the stubhead and row captions, together with the needed classifying and qualifying center head and subheads are located. The stubhead describes the stub listing as a whole in terms of the classification presented. The row caption is a descriptive title of the data on the given line. 4. Field – main part of the table; contains the substance or the figures of one’s data 5. Source note - an exact citation of the source of data presented in the table (should always be placed when the figures are not original) 6. Footnote – any statement or note inserted at the bottom of the table Stat 111 Handout 1 8 1. Table 4.4 – CRIME VOLUME AND RATE BY TYPE: 1991 – 1993 (Rate per 100,000 population) 1991 1992 1993 Type Volume Crime Volume Crime Volume Crime Rate Rate Rate Total 121,326 195 104,719 164 96,686 148 Index Crimes 77,261 124 67,354 106 58,684 90 Murder 8,707 14 8,293 13 7,758 12 Homicide 8,069 13 7,912 12 7,123 11 Physical Injury 21,862 35 20,462 32 18,722 29 Robbery 13,817 22 11,164 18 9,856 15 Theft 22,780 37 17,374 27 12,940 20 Rape 2,026 3 2,149 3 2,285 4 Nonindex Crimes 44,065 71 37,365 59 38,002 58 source Source: Philippine National Police note Graphical Presentation a graph or chart is a device for showing numerical values or relationships in pictorial form Example: Line Graph MARKET SHARES OF LEADING SOFTDRINKS IN METRO MANILA: 1989 - 1995 50 40 Coca-Cola % SHARES 30 Pepsi 20 10 0 1989 1990 1991 1992 1993 1994 1995 YEAR Stat 111 Handout 1 9 1.4 Data Description When we describe a set of data, we try to say neither too little nor too much. Statistical descriptions can be brief or elaborate, depending on the purposes they are to serve. Sometimes we present data in raw form and let them speak for themselves. On other occasions we present data as frequency distributions or as graphs. Most of the time, however, we must describe data by one or two carefully chosen numbers. It is often necessary to summarize data by means of a single number which, in its way, is descriptive of the entire set. Exactly what sort of number we choose depends on the particular characteristic we want to describe. In one study we may be interested in the value which is exceeded by only 25 percent of the data; in another, in the value which exceeds the lowest 10 percent of the data; and in still another, in a value which somehow describes the center or middle of the data. The statistical measures which describe such characteristics are called measures of central location or tendency. There are three commonly used measures of central tendency: the mean, the median, and the mode. Measures of Central Tendency Definition. A measure of central tendency is any single value that is used to identify the “center” or the typical value of a data set. It provides a summary of the data which facilitates comparison of two or more data sets. It is often referred to as the average. Characteristics of a Good Average 1. Easily understood - not a distant mathematical abstraction 2. Objective and rigidly defined - should encounter no question as to what the value is 3. Stable - not affected materially by minor variations in the groups of items 4. Easily amenable to further statistical computation The Arithmetic Mean Definition. The arithmetic mean (or simply the mean) of a data set or simply the mean is the sum of all values of the observations divided by the number of observations. To find the mean of a data set, use one of the following formulas: ∑𝑁 𝑖=1 𝑋𝑖 ∑𝑛 𝑖=1 𝑋𝑖 Population Mean: 𝜇 = Sample Mean: 𝑋 = 𝑁 𝑛 Characteristics of the Mean - Employs all available information - Strongly influenced by extreme values (outliers) - Distorted picture if the number of observations is small Stat 111 Handout 1 10 - May not be an actual number in the data set - Possesses two mathematical properties that will prove to be important in the subsequent analyses: (a) the sum of the deviations of the values from the mean is zero (b) the sum of the squared deviations is minimum when the deviations are taken from the mean - Always exists and is unique - If a constant c is added (subtracted) to all observations, the mean of the new observations will increase (decrease) by the same amount c - If all observations are multiplied or divided by a constant, the new observations will have a mean that is the same constant multiple of the original mean. Some Modifications of the Mean: a. Weighted Mean Definition. The weighted mean is a modification of the usual mean that assigns weights (or measures of relative importance) to the observations to be averaged. If each observation Xi is assigned a weight Wi, i = 1, 2, …, n, the weighted mean is given by ∑𝑛 𝑖=1 𝑊𝑖 𝑋𝑖 𝑋= ∑𝑛. 𝑖=1 𝑊𝑖 b. Combined Mean Definition. Suppose that k finite populations having N 1, N2,…, Nk, respectively, have means 𝜇1 , 𝜇2 ,…, 𝜇𝑘. Then the combined population mean 𝜇𝑐 if we combine the measurements of all the populations is ∑𝑘 𝑖=1 𝑁𝑖 𝜇𝑖 𝜇𝑐 = ∑𝑘. 𝑖=1 𝑁𝑖 Consequently, if k samples of size n1, n2, … , nk have sample means 𝑋1, 𝑋2 , …, 𝑋𝑘 , respectively, then the combined sample mean 𝑋𝑐 of the measurements if we combine all the sample data is ∑𝑘 𝑖=1 𝑛𝑖 𝑋𝑖 𝑋𝑐 = ∑𝑘. 𝑖=1 𝑛𝑖 c. Trimmed Mean We have noted that the mean may not be a good measure of central tendency in the presence of outlying observations. To mitigate the effects of this outliers, an 𝛼%-trimmed mean can be computed. To find the 𝛼%-trimmed mean of a data set, order the data, delete the lowest 𝛼% and the highest 𝛼% of these observations, and find the arithmetic mean of the remaining entries. Definition. The 𝜶%-trimmed mean is the arithmetic mean of the remaining entries after deleting the lowest 𝛼% and the highest 𝛼% of these ordered observations in a data set. Stat 111 Handout 1 11 The Median Definition. The median, Md or ̃ X, is the value that divides the array into two equal parts. If the number of observations, n, is odd, then the median is given by𝑀𝑑 = 𝑋(𝑛+1), the 2 𝑛+1 ( )th observation in the array. If n is even, then the median is the average of the 2 1 two middle values in the array, i.e., 𝑀𝑑 = (𝑋(𝑛) + 𝑋(𝑛+1) ). 2 2 2 Characteristics of the Median - The median is a positional measure. - The median is affected by the position of each item in the series but not by the value of each item. This means that extreme values affect the median less than the arithmetic mean. The Mode Definition. The mode, Mo or X ̂, is the observed value that occurs most frequently in a data set. It locates the point where the observation values occur with the greatest density which is determined by counting the frequency of each value and finding the value with the highest frequency of occurrence. Characteristics of the Mode - It does not always exist; and if it does, it may not be unique. A data set is said to be unimodal if there is only one mode, bimodal if there are two modes, trimodal if there are three modes, and so on. - It is not affected by extreme values. - The mode can be used for qualitative as well as quantitative data. Choosing the Most Suitable Measure of Central Tendency Choosing the most representative measure is not always simple since each one of these measures has its own advantages and disadvantages. This usually depends on the type of data distribution, and the rationale for describing the data. The table below shows a summary of these criteria. Stat 111 Handout 1 12 Criteria Measures of Central Tendency Mean Median Mode Definition Center of mass Center of the Typical value array At least interval scale At least interval Even if nominal Data requirement and values that are scale scale only close to each other Existence/ Always exists/ always Always exists/ Might not exist / Uniqueness unique always unique Not always unique Considers every Yes No No value? Affected by Yes* No No Outliers? * except possibly for trimmed means Examples: For the following problems, calculate the descriptive statistics needed. 1. The durations (in minutes) of power failures at a certain municipality in the last 10 years are given below. 18 26 45 75 125 80 33 40 44 49 89 80 96 125 12 61 31 63 103 28 Compute and interpret the mean, median, and the mode of the data. What is the most appropriate measure of central tendency to use? 2. For the month of April, a savings account has a balance of Php 5,200 for 24 days, Php 10,430 for 2 days, and Php 2650 for 4 days. What is the account’s mean daily balance for April? 3. Suppose the number of beds in a sample of 10 private hospitals and 14 public hospitals were collected and given below. Private 149 167 162 127 180 180 160 167 221 145 Public 137 194 207 150 254 262 244 297 137 204 166 174 180 151 a. Compute and interpret the mean, median, and the mode number of beds in the sample of private hospitals. Do the same for public hospitals. b. Compare private and public hospitals in terms of their average number of beds. Which measure of central tendency is appropriate to use? c. Combining private and public hospitals, what is the combined mean, median, and mode? Interpret these statistics. d. Suppose 50 beds were donated to each of the public hospitals, what is the new mean, median, and mode? What happens to the over-all mean, median, and mode? Stat 111 Handout 1 13 Solutions: 1. Let X = duration (in minutes) of power interruption, the mean (𝑋̅), median (𝑀𝑑), and mode (𝑀𝑜) are given by: ∑20 𝑖=1 𝑋𝑖 𝑋̅ = 20 18 + 26 + 45 + ⋯ + 103 + 28 = 20 1223 = 20 = 61.15 𝑚𝑖𝑛 𝑋(𝑛) + 𝑋(𝑛+1) 𝑀𝑑 = 2 2 2 𝑋(10) + 𝑋(11) = 2 49 + 61 = 2 = 55 𝑚𝑖𝑛 𝑀𝑜 = 80 & 125 (𝑏𝑖𝑚𝑜𝑑𝑎𝑙) For this data, the mean or median may appropriately be used since both satisfy the required nature of the variable. 2. Let 𝑋 be the savings account balance and 𝑊 be the number of days of a unique balance value, then we can write: 𝑖 𝑋𝑖 𝑊𝑖 1 5200 24 2 10430 2 3 2650 4 Using 𝑊 as weight to compute the mean daily balance for the month of April we can solve for weighted mean: ∑3𝑖=1 𝑋𝑖 𝑊𝑖 𝑋̅ = ∑3𝑖=1 𝑊𝑖 𝑋1 𝑊1 + 𝑋2 𝑊2 + 𝑋3 𝑊3 = 𝑊1 + 𝑊2 + 𝑊3 (5200)(24) + (10430)(2) + (2650)(4) = 24 + 2 + 4 = 𝑃ℎ𝑝 5208.67 The mean daily balance for the month of April is Php 5,208.67. 3. (This is left as individual exercise) Stat 111 Handout 1 14 Measures of Dispersion While measures of central tendency are useful to understand the typical values of a data set, measures of dispersion are important to describe the scatter of the data or, equivalently, data variability with respect to the central tendency. Two distinct sets of data may have the same mean or median, but different levels of variability, or vice-versa. A proper description of data set should always include both of these characteristics. There are various measures of dispersion, each with its own set of advantages and disadvantages. Definition. Measures of dispersion are quantities that indicate the extent to which individual items in a series are scattered about an average. Some Uses for Measuring Dispersion - to determine the extent of the scatter so that steps may be taken to control the existing variation - used as a measure of reliability of the average value General Classifications of Measures of Dispersion 1. Measures of Absolute Dispersion - expressed in the units of the original observations. - cannot be used to compare variations of two data sets when the averages of these data sets differ a lot in value or when the observations differ in units of measurement. 2. Measures of Relative Dispersion - unitless and are used when one wishes to compare the scatter of one distribution with another distribution. Measures of Absolute Dispersion The Range Definition. The range of a set of measurements is the difference between the largest and the smallest values. That is, Range = maximum – minimum. Characteristics of the Range - It uses only the extreme values. It fails to communicate any information about the clustering or the lack of clustering of the values between the extremes. - An outlier can greatly alter its value. The Inter-quartile Range Definition. The inter-quartile range of a set of measurements is the difference between the upper and lower quartile. That is, IQR = Q3 – Q1. Stat 111 Handout 1 15 Characteristics of the Inter-quartile Range - It uses the quartile values instead of the minimum and maximum values. Thus, it reduces the influence of extreme values. - It contains the middle 50% of the data set. The Standard Deviation and the Variance Definition. The variance is the average squared difference of each observation from the mean while the standard deviation is just the positive square root of the variance. For a finite population of size N, the population variance and standard deviation are as follows: ∑𝑁 𝑖=1(𝑋𝑖 −𝜇 ) 2 ∑𝑁 𝑖=1(𝑋𝑖 −𝜇 ) 2 Variance: 𝜎 2 = SD: 𝜎 = √. 𝑁 𝑁 For a sample of size n, the sample variance and standard deviation are 2 2 ∑𝑛 𝑖=1(𝑋𝑖 −𝑋) 𝑛 ∑ (𝑋 −𝑋) Variance: 𝑠 = 2 SD: 𝑠 = √ 𝑖=1 𝑖. 𝑛−1 𝑛−1 Characteristics of the Variance and Standard Deviation - The standard deviation is the most frequently used measure of dispersion. - The variance is not a measure of absolute dispersion. It is not expressed in the same units as the original observations. - It is affected by the value of every observation. It may be distorted by few extreme values. - If each observation of a set of data is transformed to a new set by the addition (or subtraction) of a constant c, the standard deviation of the new set of data is the same as the standard deviation of the original data set. - If a set of data is transformed to a new set by multiplying (or dividing) each observation by a constant c, the standard deviation of the new data set is equal to the standard deviation of the original data set multiplied (or divided) by c. Measures of Relative Dispersion The Coefficient of Variation Definition. The coefficient of variation, CV, is the ratio of the standard deviation to the mean and is usually expressed in percentage. It is computed as 𝜎 𝑠 Population: 𝐶𝑉 = 𝜇 × 100% Sample: 𝐶𝑉 = 𝑋 × 100% Stat 111 Handout 1 16 Characteristics of the Coefficient of Variation - It can be used to compare the variability of two or more data sets even if they have different means or different units of measurement. - It expresses the standard deviation as a percentage of the mean. - Large value of CV indicates that the data set is highly variable. - Cannot be computed when the mean is zero and is meaningless when the mean is negative. The Standard Score Definition. The standard score or Z-score measures how many standard deviations an observation is above or below the mean. It is computed as 𝑋−𝜇 𝑋−𝑋 Population: 𝑍 = Sample: 𝑍 = 𝜎 𝑠 Characteristics of the Standard Score - The standard score is not a measure of relative dispersion per se but is somewhat related. - It is useful for comparing two values from different series specially when these two series differ with respect to the mean or standard deviation or both are expressed in different units. - It can also be used to identify possible outliers in the data set. Examples: 1. Sample annual salaries (in ten thousand pesos) for employees at a company are listed below. 42 36 48 51 39 39 42 36 48 33 39 42 45 40 35 49 a. Find the sample mean and standard deviation. b. Each employee in the sample is given a 5% raise. Find the sample mean and sample standard deviation for the revised data set. c. To calculate the monthly salary, divide each original salary by 12. Find the sample mean and sample standard deviation for the revised data set. d. Each employee in the sample is given a Php 10,000 raise. Find the new sample mean and sample standard deviation. 2. The foreign exchange rate is an indicator of the stability of the peso and is also an indicator of the economic performance. In 1992, Bangko Sentral ng Pilipinas (BSP) put the peso on a floating rate basis. Market forces and not government policy have determined the level of the peso since. Government intervenes through the BSP, only when there are speculative elements in the market. Given below are the means and standard deviations of the quarterly P-$ exchange rate for the periods 1989 to 1991 and 1992 to 1994. Which of the two periods is more stable? Stat 111 Handout 1 17 Period Mean SD 1989 – 1991 22.4 1.84 1992 – 1994 26.4 1.15 3. Two of the quality criteria in processing butter cookies are the weight and color development in the final stages of oven browning. Individual pieces of cookies are scanned by a spectrophotometer calibrated to reflect yellow-brown light. The readout is expressed in per cent of a standard yellow-brown reference plate and a value of 41 is considered optimal (golden-yellow). The cookies were also weighed in grams at this stage. The means and standard deviations of 30 sample cookies are presented below. Mean SD Color 41.1 10.0 Weight 17.7 3.2 Which of the two quality criteria is more varied? 4. Different typing skills are required for secretaries depending on whether one is working in a law office, an accounting firm, or for a mathematical research group at a major university. In order to evaluate candidates for these positions, an agency administers 3 distinct standardized typing samples. A time penalty has been incorporated into the scoring of each sample based on the number of typing errors. The mean and standard deviation for each test, together with the scores achieved by Nancy, an applicant, are given in the following table. Where do you think should Nancy be placed? Sample Nancy’s Score Mean SD Law 141 sec 180 sec 30 sec Accounting 7 min 10 min 2 min Scientific 33 min 26 min 5 min Solutions: 1. Let 𝑋 be the annual salaries (in ten thousand pesos) for employees at a company, a. The mean and standard deviation is given by: ∑16 𝑖=1 𝑋𝑖 𝑋̅ = = 41.5 16 ∑16 ̅ 2 𝑖=1(𝑋𝑖 − 𝑋 ) 𝑠 =√ 16 − 1 (𝑋1 − 𝑋̅)2 + (𝑋2 − 𝑋̅)2 + ⋯ + (𝑋16 − 𝑋̅)2 =√ 15 (42 − 41.5)2 + (36 − 41.5)2 + ⋯ + (49 − 41.5)2 =√ 15 = 5.42 Stat 111 Handout 1 18 b. The updated salaries can be written as 𝑋𝑖𝑛𝑒𝑤 = 1.05𝑋𝑖 , where 𝑋𝑖 is the old salary for employee 𝑖. Thus, we can use the property of the mean and standard deviation and compute the new measures by: 𝑋̅𝑛𝑒𝑤 = 𝑐 (𝑋̅) = 1.05(41.5) = 43.6 𝑠𝑛𝑒𝑤 = 𝑐 (𝑠) = 1.05(5.42) = 5.69 c. Similar to (b) we can write the new data as 𝑋𝑖𝑛𝑒𝑤 = 𝑋𝑖 /12, where 𝑋𝑖 is the old salary. The new mean and standard deviation is given by: 𝑋̅𝑛𝑒𝑤 = (𝑋̅)/𝑐 = (41.5)/12 = 3.5 𝑠𝑛𝑒𝑤 = (𝑠)/𝑐 = (5.42)/12 = 0.45 d. The new data can be written as 𝑋𝑖𝑛𝑒𝑤 = 𝑋𝑖 + 10, where 𝑋𝑖 is the old salary for employee 𝑖. Using the properties of mean and standard deviation, the update values are: 𝑋̅𝑛𝑒𝑤 = (𝑋̅) + 𝑐 = (41.5) + 10 = 51.5 𝑠𝑛𝑒𝑤 = (𝑠) = (5.42) = 5.42 2. Given below is the mean and standard deviation of peso-dollar exchange rate of the two periods. Period Mean SD 1989 – 1991 22.4 1.84 1992 – 1994 26.4 1.15 Let 𝑋̅1and 𝑠1 be the mean and standard deviation of 89-91 period, respectively. Similarly, let 𝑋̅2 and 𝑠2 be the mean and standard deviation respectively of 92-94 period. The coefficient of variation is given by: 𝑠1 1.84 𝐶𝑉1 = 100 = 100 = 8.21 𝑋̅1 22.4 𝑠2 1.15 𝐶𝑉2 = 100 = 100 = 4.36 𝑋̅2 26.4 The peso-dollar exchange rate is more stable in the period (1992-1994) than (1989- 1991) period as indicated by a lower CV value (4.36 < 8.21) 3. Given the mean and standard deviation of the measures for two criteria, the coefficient of variation (CV) is computed by: Criteria Mean SD Color 41.1 10.0 Weight 17.7 3.2 Stat 111 Handout 1 19 𝑠𝑐𝑜𝑙𝑜𝑟 10.0 𝐶𝑉𝑐𝑜𝑙𝑜𝑟 = 100 = 100 = 24.33 𝑋̅𝑐𝑜𝑙𝑜𝑟 41.1 𝑠𝑤𝑒𝑖𝑔ℎ𝑡 3.2 𝐶𝑉𝑤𝑒𝑖𝑔ℎ𝑡 = 100 = 100 = 18.08. 𝑋̅𝑤𝑒𝑖𝑔ℎ𝑡 17.7 The color is more variable than weight criteria in the manufacturing process of cookies. 4. To measure the performance of Nancy relative to other applicants, her Standardized (Z) score is computed for each test. Let 𝑍𝑙𝑎𝑤 , 𝑍𝑎𝑐𝑐 , and 𝑍𝑠𝑐𝑖 refer to her Z-scores to Law, Accounting, and Scientific typing test respectively. 𝑋𝑙𝑎𝑤 − 𝑋̅𝑙𝑎𝑤 (141 − 80) 𝑍𝑙𝑎𝑤 = = = −1.3 𝑠𝑙𝑎𝑤 30 𝑋𝑎𝑐𝑐 − 𝑋̅𝑎𝑐𝑐 (7 − 10) 𝑍𝑎𝑐𝑐 = = = −1.5 𝑠𝑎𝑐𝑐 2 𝑋𝑠𝑐𝑖 − 𝑋̅𝑠𝑐𝑖 (33 − 26) 𝑍𝑠𝑐𝑖 = = = 1.4 𝑠𝑠𝑐𝑖 5 A lower z-score indicates that Nancy performed faster compared to other applicants. The lowest among the three is 𝑍𝑎𝑐𝑐 = −1.5. This means that Nancy should be assigned to the Accounting firm. Stat 111 Handout 1 20

Descriptive Statistics Handout PDF - Statistical Methods, Data Collection | University of the Philippines Visayas

Document Details

Tags

Related

Summary

Full Transcript