University of St. La Salle BSTAT Handouts PDF
Document Details
University of St. La Salle
S. R. LEONARES, PHD
Tags
Summary
This document is a set of handouts for the first semester of BSTAT at the University of St. La Salle. It covers topics related to statistics in the research process and provides reasons for studying statistics. The handouts include information about conducting research, reading journals, developing critical thinking skills, and being an informed consumer of statistical information.
Full Transcript
UNIVERSITY OF ST. LA SALLE College of Business and Accountancy BSTAT – BUSINESS STATISTICS First Semester, Ay 2020 – 2021 HANDOUTS 1...
UNIVERSITY OF ST. LA SALLE College of Business and Accountancy BSTAT – BUSINESS STATISTICS First Semester, Ay 2020 – 2021 HANDOUTS 1 STATISTICS IN THE RESEARCH PROCESS "Statistics can be fun or at least they don't need to be feared." Many folks have trouble believing this premise. Often, individuals walk into their first statistics class experiencing emotions ranging from slight anxiety to borderline panic. It is important to remember, however, that the basic mathematical concepts that are required to understand introductory statistics are not prohibitive for any university student. The key to doing well in any statistics course can be summarized by two words, "KEEP UP!" If you do not understand a concept--reread the material, do the practice questions, and do not be afraid to ask your professor for clarification or help. This is important because the material discussed four weeks from today will be based on material discussed today. If you keep on top of the material and relax a little bit, you might even find you enjoy this introduction to basic measurements and statistics. Why Study Statistics? "Why do I need to learn statistics?" or "What future benefit can I get from a statistics class?" There are five primary reasons to study statistics: The first reason is to be able to effectively conduct research. Without the use of statistics it would be very difficult to make decisions based on data collected from a research project. Statistics provides us with a tool with which to make an educated decision. A second point about research should be made. It is extremely important for a researcher to know what statistics they want to use before they collect their data. Otherwise data might be collected that is not interpretable. Unfortunately, when this happens it results in a loss of data, time, and money. Although you may never plan to be involved in research, research may find its way into your life. Certainly, if you decide to continue your education and work on a masters or doctoral degree, involvement in research will result from that decision. Secondly, more and more work places are conducting internal research or are part of broader research studies. Thus, you may find yourself assigned to one of these studies. The second reason to study statistics is to be able to read journals. Most technical journals you will read contain some form of statistics. Usually, you will these statistics in something called the results section. Without an understanding of statistics, the information contained in this section will be meaningless. An understanding of basic statistics will provide you with the fundamental skills necessary to read and evaluate most results sections. The ability to extract meaning from journal articles and the ability to critically evaluate research from a statistical perspective are fundamental skills that will enhance your knowledge and understanding in related coursework. S. R. LEONARES, PHD 1 The third reason is to further develop critical and analytic thinking skills. The study of statistics will serve to enhance and further develop these skills. To do well in statistics one must develop and use formal logical thinking abilities that are both high level and creative. The fourth reason to study statistics is to be an informed consumer. Like any other tool, statistics can be used or misused. Yes, it is true that some individuals do actively lie and mislead with statistics. More often, however, well-meaning individuals unintentionally report erroneous statistical conclusions. If you know some of the basic statistical concepts, you will be in a better position to evaluate the information you have been given. The fifth reason to have a working knowledge of statistics is to know when you need to hire a statistician. Conducting research is time consuming and expensive. If you are in over your statistical head, it does not make sense to risk an entire project by attempting to compute the data analyses yourself. It is very easy to compute incomplete or inappropriate statistical analysis of one's data. It is also important to have enough statistical savvy to be able to discuss your project and the data analyses you want computed with the statistician you hire. In other words, you want to be able to make sure that your statistician is on the right track. (https://universalteacher.com/1/reasons-for-conducting-research/) Statistics are part of our everyday life. Science fiction author H. G. Wells in 1903 stated, ""Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." Wells was quite prophetic as the ability to think and reason about statistical information is not a luxury in today's information and technological age. Anyone who lacks fundamental statistical literacy, reasoning, and thinking skills may find they are unprepared to meet the needs of future employers or to navigate information presented in the news and media On a most basic level, all one needs to do open a newspaper, turn on the TV, examine the baseball box scores, or even just read a bank statement to see statistics in use on a daily basis. Statistics in and of themselves are not anxiety producing. The idea of statistics is often anxiety provoking simply because it is a tool with which we are unfamiliar. ------------------------------------------------------------------------------------------------------------------------------------------- STATISTICS: Defined: A branch of science which deals with the collection, organization, presentation, analysis, and interpretation of data. A body of techniques and procedures dealing with the collection, organization, analysis, interpretation, and presentation of information that can be stated numerically. The backbone of (Quantitative) Research Two Branches of Statistics Descriptive statistics are used to organize or summarize a particular set of measurements. These deal with organizing and summarizing observations so that they are easier to comprehend. The census of households conducted by the Philippine Statistics Authority every five years represents an example of how descriptive statistics are generated. The information that is gathered concerning S. R. LEONARES, PHD 2 gender, race, income, etc. is compiled to describe the population of the Philippines at a given point in time. Collection, Organization, Presentation, and Analysis are part of descriptive statistics. Inferential statistics use data gathered from a sample to make inferences or generate conclusions about the larger population from which the sample was drawn. Opinion polls and television ratings systems represent some uses of inferential statistics. For example, a limited number of people are polled during an election and then this information is used to describe voters as a whole. Interpretation falls under Inferential Statistics. Example: We wanted to know the level of job satisfaction nurses experience working on various units within a particular hospital (e.g., psychiatric, cardiac care, obstetrics, etc.). The first thing we would need to do is collect some data. We might have all the nurses on a particular day complete a job satisfaction questionnaire. We could ask such questions as "On a scale of 1 (not satisfied) to 10 (highly satisfied), how satisfied are you with your job?". We might examine employee turnover rates for each unit during the past year. We also could examine absentee records for a two month period of time as decreased job satisfaction is correlated with higher absenteeism. Once we have collected the data, we would then organize it. In this case, we would organize it by nursing unit. Absenteeism Data by Unit in Days Psychiatric Cardiac Care Obstetrics 3 8 4 6 9 4 4 10 3 7 8 5 5 10 4 Mean = 5 9 4 Thus far, we have collected our data and we have organized it by hospital unit. You will also notice from the table above that we have performed a simple analysis. We found the mean (you probably know it by the name "average") absenteeism rate for each unit (descriptive statistics). Next, we would interpret our data (inferential statistics). We could take the information gained from our nursing satisfaction study and make inferences to all hospital nurses. We might infer, and therefore conclude, that cardiac care nurses as a group are less satisfied with their jobs as indicated by the high absenteeism rate. This course will be discussed in light of the role of statistics in the research, particularly quantitative research, process. Statistics in the Research Process: “Research is a procedure for carefully finding accurate solutions to important and relevant questions by the use of scientific method of gathering and interpreting information. Doing research is a multi- dimensional skill. Carrying out successful research must exceed the bounds of printed paper, and leap out to influence opinions and opinion shapers.” (https://universalteacher.com/1/reasons-for-conducting- research/) S. R. LEONARES, PHD 3 The Research Process (from the standpoint of Statistics) : Formulate the research problem (this could be your general or specific objective) S – pecific M - easurable A – attainable R – ealistic T – ime bound Remarks: A research objective that is SMART sets a very good road map for the conduct of research: The scope/population is delineated, hence it can be determined beforehand whether to do a census (gathering data from the whole population) or a survey (gathering data from a sample) will be conducted The subjects (sources of information) are identified, hence the appropriate method of data collection can be determined The kind of information needed to answer the problem/objective is known at the beginning of the study The type of objective is known, hence the appropriate descriptive and/or inferential statistical tools are anticipated Define the population of the study o Population – all subjects under investigation – the set of all elements of interest in a particular study o Sample – a subset of the population Notes: a. In order to identify the population of the study, ask the question, “Who/What are going to provide the information needed to answer the research problem?” b. the population of the study need not consist of a human population Identify the variable/s of the study o Variable – measurable characteristic or attribute of the subject that is the focus of the study that can take on different values Notes: a. In order to determine the variable/s of the study, ask the question, “What information is needed from each subject (element of the population) in order to answer the research problem?” b. A research problem or specific objective may involve one or more variables. c. It would be a good practice to determine the variable/s of each stated specific objective so as not to miss any information needed from each subject Example: Problem: What is the mean weekly household food expense of a USLS BStat student for the first semester of AY 2020 – 2021? Population of study: All USLS BStat students for the first semester, AY 2020 – 2021 S. R. LEONARES, PHD 4 Question: How will the description and scope of the population be affected if each of the following is omitted? a. USLS b. BStat d. first semester e. AY 2020-2021 Variable/s: weekly household food expense (only one information is needed from each USLS BStat student for the first semester of AY 2020-2021) Remarks: 1) Identifying the variable/s early in the study eliminates the possibility of a. missing it when eventually formulating the instrument or b. including variables in the instrument that are not necessary in answering the problem/objectives 2) Identifying the variable/s enables the researcher to determine the type of variable/s and the level/s of scale of the data that will be collected. These, in turn, determine the types of analysis and interpretation that will be applied to generate needed results : : (Anticipated) Conclusion (think ahead as to how the answer to the research problem/specific objective will look like): The mean weekly household food of a USLS BStat student for the first semester of AY 2020-2021 is _______. Notes: a. This will help you to anticipate that you need to compute for the mean of the weekly household food expense values that you have collected from all members of the population b. More importantly, you conclusion should be consistent with the statement of your problem/specific objectives, that is, it is about the population under study, so the conclusion should be about the population under study This is not a problem if a census is conducted – the conclusion is straightforward, like in the example above However, if only a sample was taken from the population for the study, the conclusion should never be about the sample; it should still be about the population, hence, its form will be quite different from the anticipated conclusion as in the example above (inferential statistics can provide a template for specific types of objectives) CLASSIFICATION OF RESEARCH OBJECTIVES/GOALS: Each state objective can be differentiated according to the following classification. This will guide the researcher to anticipate the type of analysis and interpretation that is required of the objective. S. R. LEONARES, PHD 5 Analytic goals: directed toward finding out from the data one or more of the following attributes of characteristics of the group being studied: 1. Central tendency – general characteristic of the group Examples: a. To determine the mean weekly allowance of USLS College Freshmen for the first semester, AY 2020 – 2021. b. To determine the percentage of USLS College students who prefer a Samsung over a Vivo cellphone for the first semester, AY 2020-2021. 2. Variance in the group – how individual members of the group vary from the average characteristic of the group Examples: a. To determine the age range of the students in this class. b. To determine if the final grades in this class are similar. 3. Difference within the group/between groups – whether or not subgroups of the group/ two separate groups being studied are different or similar on certain traits investigated (special case: comparison between/among two or more groups with regards to a particular variable) Examples: a. To compare the mean no. of Coke Sakto bottles consumed in July, 2020 between the male and female USLS students. b. To determine if there is a significant difference in the mean number of text messages sent in a day among the students from the five different colleges of USLS for the first semester, AY 2020-2021. 4. Relationships within the group – if relationship between certain variables covered in the study exists Examples: a. To establish if there is a significant relationship between choice of cellphone brand and the college a USLS student belongs to for the first semester, AY 2020-2021. b. To determine if relationship status and final grades in Statistics are independent for the first semester, AY 2020-2021. 5. Prediction – establishing a mathematical/statistical model to predict future outcomes Examples: a. What factors influence the a graduate’s ability to land a job within one year after graduation? b. What is the estimated sales of a particular restaurant for next week if the present conditions hold? Types of Analysis: 1. Descriptive – limited to the description of the particular group being studied a conclusion cannot be applied to cases outside the study group S. R. LEONARES, PHD 6 2. Inferential – application of the findings or conclusions from a small group to a large group from which the smaller group was drawn To summarize, the following diagram shows the aspects of statistics involved in a research process, depending on the scope of the study: Population study Sample study Sampling Collection Collection Organization Organization Presentation Presentation Analysis Analysis Interpretation Conclusion (always about the population) AVOID any one of two possible procedural errors: 1. You did a population study but you used inferential statistics to arrive at the conclusion. 2. You did a sample study but you did not use inferential statistics to arrive at the conclusion. Remember, inferential statistics is applied only in order to generate conclusions about the population BASED ON SAMPLE DATA. TYPES OF VARIABLES (inherent characteristic of the variable; does not change) 1. Qualitative/Categorical Attributes are in terms of categories or levels - the descriptions that you give a variable that help to explain how variables should be measured, manipulated and/or controlled. Examples: Variable Categories/levels 1. sex categories - Male - Female S. R. LEONARES, PHD 7 2. Religion categories - Roman Catholic - Protestant - Iglesia ni Cristo - Islam - Others, please specify _______ 3. Importance of university to levels - strongly agree getting a good job - agree - neither agree nor disagree - disagree - strongly disagree Notes: 1. categories vs levels Categories – do not have/possess an intrinsic order; they are all considered equal Levels – possess intrinsic or inherent order from one “category” to the next 2. Categories/levels should be a. exhaustive – should cover all possible answers (oftentimes, the use of “Others, please specify” serves the purpose of including all possibilities, especially those categories with small frequencies). This will prevent the respondent from being confused about what answer to tick () or mark with an x since his or her desired response is not among the given options b. mutually exclusive – should make sure that the categories do not overlap in order to ensure that the respondents provide only one answer. This will prevent the respondent from being confused as to which category to tick () or mark with an x if there is more than one possible answer. This holds true even for multiple response questions. 2. Quantitative/Numerical The variable has numerical properties which are the values by which the said variables can be measured, manipulated and/or controlled Attributes are in terms of counts (discrete) or measurements (continuous) Distinctions/Types of quantitative variables : a. Discrete Variable uses the process of counting to generate data values of attributes are in terms of whole numbers only Examples: a. Number of t-shirts owned b. Number of pocketbooks read b. Continuous Variable uses the process of measuring to generate data (with the use of a measuring instrument) values of attributes may have fractional or decimal parts Examples: S. R. LEONARES, PHD 8 a. Weight of a package b. Volume of water c. temperature Note: for continuous variables, it is important to append the unit of measurement since the result may have a different value depending on the unit Example: For discrete variables, the value of a number remains the same regardless of the variable: 5 chairs vs 5 students ( the value of 5 is the same for both) For continuous variables, the value of a number depends on the unit of measurement, even if the same variable is being measured: 5 inches vs 5 feet (length measuring 5 inches is shorter than length measuring 5 feet) READ: http://dissertation.laerd.com/types-of-variables.php FUNCTIONS OF VARIABLES Not an intrinsic property of the variable; it depends on the role of the variable in a study Important if the investigation is about cause and effect Distinctions: a. Independent Variable sometimes called an experimental or predictor variable is a variable that is being manipulated in an experiment in order to observe the effect this has on a dependent variable what the researcher (or nature) manipulates -- a treatment or program or cause b. Dependent Variable sometimes called an outcome variable a variable that is dependent on an independent variable(s) what is affected by the independent variable -- the effects or outcomes Example: Study/Problem: the effects of a new educational program on student achievement Independent variable - the program Dependent variables - measures of achievement a variable may function as an independent variable in one study and a dependent variable in another MEASUREMENT AND MEASUREMENT SCALES What is Measurement? Defn: Measurement – The process of assigning numbers to observations or observed characteristics S. R. LEONARES, PHD 9 Normally, when one hears the term measurement, they may think in terms of measuring the length of something (e.g., the length of a piece of wood) or measuring a quantity of something (e.g., a cup of coffee).This represents a limited use of the term measurement. In statistics, the term measurement is used more broadly and is more appropriately termed scales of measurement. Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement has certain properties which in turn determines the appropriateness for use of certain statistical analyses. The four scales of measurement are nominal, ordinal, interval, and ratio. 1. Nominal Scale Consists of numbers which indicate categories for purely classification or identification purposes The numbers serve as codes only; any number can be used to represent a category as long as they do not duplicate The numbers do not indicate order among the categories The numbers have no numeric properties, hence, the four fundamental operations (addition, subtraction, multiplication, division) cannot be applied to the numbers in the nominal scale The categories are mutually exclusive (the observations cannot fall into more than one category) The categories are exhaustive (there must be enough categories for all the observations) Example: Sex: Male =1 Female = 2 Remarks: a. assigning the number 2 to Female does not imply that females are “better” than males b. these numbers cannot be arithmetically manipulated, for example, to get the “average sex” 2. Ordinal Scale Possesses rank order characteristics the categories must still be mutually exclusive and exhaustive, but they also indicate the order of magnitude of some variable the numbers serve as codes but must now be assigned in consecutive order, indicating degree of level (for example: lowest to highest, most preferred to least preferred, etc.) Example: Likert item response: Strongly agree =1 Agree =2 Neither agree nor disagree = 3 Disagree =4 Strongly disagree =5 Remarks: a. Although the numbers are arranged in consecutive order, it cannot be assumed that the differences between two consecutive numbers are the same anywhere in the scale, for example, the degree of difference of “1” in responses between strongly agree (1) and agree (2) is not necessarily the same as that between disagree (4) and strongly disagree (5) S. R. LEONARES, PHD 10 b. Fundamentally, these scales do not represent a measurable quantity; for this reason, arithmetic operations on the numbers are supposedly not applicable Example: Likert-type items (such as "On a scale of 1 to 10, with one being no pain and ten being high pain, how much pain are you in today?") also represent ordinal data. An individual may respond 8 to this question and be in less pain than someone else who responded 5. A person may not be in exactly half as much pain if they responded 4 than if they responded 8. All we know from this data is that an individual who responds 6 is in less pain than if they responded 8 and in more pain than if they responded 4. Therefore, Likert-type items only represent a rank ordering. REMEMBER: a. Nominal and Ordinal scale data are basically categories/levels converted to numeric codes. b. Qualitative variables generate either nominal (categories) or ordinal (levels) scale data. 3. Interval Scale Has all the properties of the ordinal scale A scale that represents quantity and has equal units A given interval (distance) between scores has the same meaning anywhere on the scale Interval scale provides information about how much better one value is compared with another zero does not represent the absolute lowest value but represents simply an additional point of measurement and not the absence of the property being measured Examples: a. temperature measured on Celsius scale Temperature is defined as the measure of the warmth or coldness of an object or substance with reference to some standard value Water boils at 100Celsius, freezes at 0Celsius (ice is cold to the touch) However, 0Celsius does not imply complete absence of heat – there are substances colder than ice (dry ice, liquid nitrogen) – so 0Celsius is not the absolute lowest value in the Celsius thermometer b. score on a test Test measures knowledge gained by a student about the topic A score of 0 does not imply complete absence of knowledge gained by a student about the topic 4. Ratio Scale Possesses all the characteristics of the interval scale (represents quantity and has equality of units) The most informative scale as it tends to tell about the order and number of the object between the values of the scale Allows comparison of intervals or differences Has a true or absolute zero point (no numbers exist below zero, i.e., there are no negative numbers) S. R. LEONARES, PHD 11 The ratio of two values is meaningful because the zero point characteristic makes it relevant or meaningful to say, “one object has twice the length of the other” or “is twice as long.” Examples: a. Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure cannot go below zero centimeters. A negative length is not possible. b. Cost of today’s lunch c. length of time of a full-length movie REMEMBER: a. Interval and Ratio scale data are possess inherently numeric characteristics b. Quantitative variables generate either interval or ratio scale data. The table below will help clarify the fundamental differences between the four scales of measurement: Indications Indicates Direction of Indicates Amount of Absolute Difference Difference Difference Zero Nominal X Ordinal X X Interval X X X Ratio X X X X You will notice in the above table that only the ratio scale meets the criteria for all four properties of scales of measurement. ------------------------------------------------------------------------------------------------------------------------------------------- EXERCISES 1. Indicate whether each of the following examples refers to a population or to a sample. a. A group of 25 customers selected to taste a new soft drink b. Salaries of all CEOs in the pharmaceutical industry c. Customer satisfaction ratings of all clients of a local bank d. Monthly phone expenses of selected Globe subscribers 2. Indicate whether the following are qualitative (QL), quantitative discrete (QD) or quantitative continuous (QC) variables and the corresponding level of measurement of the data generated for each variable. a. Brand of jeans you prefer b. Ratio of current assets to current liabilities c. Number of text messages received per day d. Rating of the management skills of a company president e. Number of banks in the municipalities and cities of Negros Occidental f. Ranking of professional tennis players S. R. LEONARES, PHD 12 g. Scores of freshmen college students on an attitude towards math scale h. Effectiveness of a drug for headache, measured in minutes i. Earnings per share j. Number of leaves k. Weekly allowance l. Distance of the student’s house from school m. Color of the hair n. Zip code 2. Identify the level of measurement of the following variables. a. Age f. Favorite TV show b. Place of birth g. Shoe size c. Number of children in the family h. High school GPA d. Grade in Math 1 i. Family monthly income e. Height (in cm.) j. Travel time (in minutes) from USLS to residence 3. A researcher measures two individuals and the uses the resulting scores to make a statement comparing two individuals. For each of the following statements, identify the scale of measurement (nominal, ordinal, interval, ratio) that the researcher used. a. I can only say that the two individuals are different. b. I can say that one individual scored 6 points higher than the other. c. I can say that one individual scored higher than the other, but I cannot specify how much higher. d. I can say that the score for one individual is twice as large as the score for the other individual. 4. A firm is interested in testing the advertising effectiveness of a new television commercial. As part of the test, the commercial is shown on a 6:30 PM local news program in Bacolod City. Two days later, a market research firm conducts a telephone survey to obtain information on recall rates (percentage of viewers who recall seeing the commercial) and impressions of the commercial. a. What is the population for this study? __________________________________________ _________________________________________________________________________ b. What is the sample for this study?_____________________________________________ _________________________________________________________________________ c. Why would a sample be used in this situation? Explain. S. R. LEONARES, PHD 13 UNIVERSITY OF ST. LA SALLE Yu An Log College of Business and Accountancy BSTAT – BUSINESS STATISTICS FIRST SEMESTER, AY 2020 – 2021 HANDOUTS 2 SAMPLING, DATA COLLECTION & ORGANIZATION Defn: Sampling – the process of selecting the subjects of the population to be included in the sample Why Sample? Why should we not use the population as the focus of study? There are at least four major reasons to sample. First, it is usually too costly to test the entire population. The second reason to sample is that it may be impossible to test the entire population. For example, let us say that we wanted to test the 5-HIAA (a serotonergic metabolite) levels in the cerebrospinal fluid (CSF) of depressed individuals. There are far too many individuals who do not make it into the mental health system to even be identified as depressed, let alone to test their CSF. The third reason to sample is that testing the entire population often produces error. Thus, sampling may be more accurate. Perhaps an example will help clarify this point. Say researchers wanted to examine the effectiveness of a new drug on Alzheimer's disease. One dependent variable that could be used is an Activities of Daily Living Checklist. In other words, it is a measure of functioning on a day to day basis. In this experiment, it would make sense to have as few of people rating the patients as possible. If one individual rates the entire sample, there will be some measure of consistency from one patient to the next. If many raters are used, this introduces a source of error. These raters may all use a slightly different criteria for judging Activities of Daily Living. Thus, as in this example, it would be problematic to study an entire population. The final reason to sample is that testing may be destructive. It makes no sense to lesion the lateral hypothalamus of all rats to determine if it has an effect on food intake. We can get that information from operating on a small sample of rats. Also, you probably would not want to buy a car that had the door slammed five hundred thousand time or had been crash tested. Rather, you probably would want to purchase the car that did not make it into either of those samples. It is extremely important to choose a sample that is truly representative of the population so that the inferences derived from the sample can be generalized back to the population of interest. Improper and biased sampling is the primary reason for often divergent and erroneous inferences reported in opinion polls and exit polls conducted by different polling groups LEONARES, S. R. 1 Types of Sampling: A. Probability sampling each element of the population is given a chance (non-zero probability) of being included in the sample minimizes, if not eliminates, selection bias ideal if generalizability of results is important for your study inferential statistical procedures can be used for arriving at generalizations/conclusions about the population based on sample data 1. Simple Random Each element of the population is given an equal chance of being included in the sample Most basic probability sampling procedure Foundation of all probability sampling procedures When to use: – The population is homogeneous – A sampling frame is available (sampling frame – complete and updated list of the population) Procedure: – Lottery – Use of random number generators 2. Systematic Random Selecting every kth element of the population When to use: – When the population is homogenous and there is no suspicion of a trend or pattern in the frame or geographical layout – A sampling frame is available Procedure: i. Determine the sampling interval, k = N/n (rounded to the nearest interval) ii. Identify the random start, rs: 1 ≤ rs ≤ k (randomly drawing a value between 1 and k) iii. Determine the number of the elements to be included in the sample: rs, rs + k, rs + 2k, … Example: N (population size) = 10,100 n (sample size) = 150 k = N/n = 10,100/150 = 67.3 67 rs = a randomly chosen number between 1 and 67 suppose rs = 43 => #43 in the sampling frame becomes the first to be included in the sample second = rs + k = 43 + 67 = #110 third = rs + 2k = 43 + 2(67) or 110 + 67 = #177, etc continue getting the numbers until the sample size of 150 is reached. LEONARES, S. R. 2 3. Stratified Random selecting random samples from mutually exclusive subpopulations, or strata, of the population. When to use: – When the population is heterogeneous but can be subdivided into homogeneous subgroups or strata – A sampling frame is available for each stratum Procedure: i. Determine the proportion of each stratum relative to the population ii. Identify the stratum sample sizes using proportional allocation iii. Select the samples from each stratum using either simple or systematic random sampling Example: Among the 250 employees of the local office of an international insurance company, 182 are Filipinos, 51 are Chinese, and 17 are Americans. If we use proportional allocation to select a stratified random grievance committee of 15 employees, how many employees must we take from each race? Solution: Race (i) Ni % ni Filipino 182 72.8 15 x 72.8% = 11 Chinese 51 20.4 15 x 20.4% = 3 American 17 6.8 15 x 6.8% = 1 Total 250 100 15 Therefore, 11 Filipinos, 3 Chinese, and 1 American will compose the grievance committee. These will have to be randomly selected from among each of the subgroups. 4. Cluster Random Selecting clusters of elements rather than individual elements When to use: – when "natural" groupings are evident in a statistical population – a sampling frame is not available Procedure: i. Divide the population into clusters (M =total number of clusters) ii. Randomly select m clusters iii. Include all elements within the selected clusters to form the resulting sample 5. Multi-stage random sampling Repeated cluster sampling LEONARES, S. R. 3 B. Non-probability sampling not all elements of the population are given a chance of being included in the sample prone to selection bias however, there may be unique circumstances where non-probability sampling can also be justified (e.g., medical researches), although the generalizability of the conclusion is not assured inferential statistical procedures cannot be used for arriving at generalizations/conclusions about the population based on sample data 1. Convenience / Voluntary /Haphazard/Accidental Sample elements are selected because they are available 2. Judgmental/Purposive/Expert The researcher selects the sample based on his judgment as to who best fit the established criteria 3. Quota Selecting sample elements nonrandomly according to some fixed quota 4. Snowball Especially useful when you are trying to reach populations that are inaccessible or hard to find Problems in Sampling: There are several potential sampling problems. When designing a study, a sampling procedure is also developed including the potential sampling frame. Several problems may exist within the sampling frame. 1. Missing elements - individuals who should be on your list but for some reason are not on the list. For example, if my population consists of all individuals living in a particular city and I use the phone directory as my sampling frame or list, I will miss individuals with unlisted numbers or who can not afford a phone. 2. Foreign elements - Elements which should not be included in my population and sample appear on my sampling list. Thus, if I were to use property records to create my list of individuals living within a particular city, landlords who live elsewhere would be foreign elements. In this case, renters would be missing elements. 3. Duplicates - these are elements who appear more than once on the sampling frame. For example, if I am a researcher studying patient satisfaction with emergency room care, I may potentially include the same patient more than once in my study. If the patients are completing a patient satisfaction questionnaire, I need to make sure that patients are aware that if they have completed the questionnaire previously, they should not complete it again. If they complete it more that once, their second set of data represents a duplicate. Read: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context= oa_textbooks (Chapter 8 – Sampling) LEONARES, S. R. 4 DATA COLLECTION PROCEDURES 1. Interview There is interaction between interviewer and respondent Most important method of data collection Some advantages: o Clarifications about ambiguous questions/answers can be made o More in-depth information can be generated Some disadvantages: o Time-consuming o Costly o Responses may be influenced by the interviewer 2. Questionnaire No interaction between facilitator and respondent about the subject matter Respondent personally answers the questions on survey forms Some advantages: o Less costly o Less time- consuming o Responses are not influenced by the interviewer o Respondents answer the questions with relative anonymity; may answer more truthfully Some disadvantages: o Not effective if the respondent is illiterate o Clarifications about vague questions cannot be made o Respondents may misinterpret the questions o Intended respondents may not personally answer the forms; may request other people to respond o Low rate of returns 3. Experimentation a controlled study in which the researcher attempts to understand cause-and-effect relationships The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives. 4. Observation Like experiments, observational studies attempt to understand cause-and-effect relationships Unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives. Also used for behavioral, attitudinal studies Read: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context= oa_textbooks (Chapters 9 & 10) LEONARES, S. R. 5 ORGANIZATION AND PRESENTATION OF DATA What is data organization? - a systematic arrangement of summarizing raw data so it is easier to analyze and study ORGANIZING AND SUMMARIZING QUALITATIVE DATA Frequency Distribution - A tabular summary of data showing the number (frequency) of items in each of several non-overlapping classes. Example: The following data were obtained from a sample of 50 soft drink purchases. Construct a frequency distribution to summarize the data. Variable: soft drink purchased Coke Coke Zero Pepsi Max Pepsi Pepsi Coke Zero Coke Zero Sprite Coke Coke Pepsi Coke Zero Pepsi Max Coke Zero Pepsi Max Pepsi Max Sprite Sprite Coke Zero Pepsi Max Pepsi Max Coke Coke Pepsi Coke Sprite Coke Coke Mountain Dew Mountain Dew Mountain Dew Coke Pepsi Max Coke Pepsi Mountain Dew Pepsi Pepsi Max Mountain Dew Pepsi Max Coke Pepsi Coke Pepsi Max Sprite Coke Pepsi Coke Sprite Mountain Dew Salient points of a frequency distribution table: a. appropriate label headings b. categories of the variable being organized should be non-overlapping example: variable: soft drink brand categories: Coke, Coke Zero, Pepsi, Pepsi Max, Sprite, Mountain Dew b. frequency – number of times the category appeared in the data set c. percent – (frequency of the category total) x 100% Table 1. Distribution of Soft Drink Purchases Soft Drink Brand Frequency Percent Coke 14 28 Coke Zero 6 12 Pepsi 8 16 Pepsi Max 10 20 Sprite 6 12 Mountain Dew 6 12 Total 50 100 Note: Follow the APE format in presenting data using a table. LEONARES, S. R. 6 Graphical presentations of qualitative data: 1. Bar graph – A graphical device for depicting qualitative data that have been summarized in a frequency, or percent distribution 16 14 No. of bottles bought 12 10 8 6 4 2 0 Coke Coke Zero Pepsi Pepsi Max Sprite Mountain Dew Soft drink brand Fig. 1.1. Soft drink purchases of buyers 2. Pie chart – A graphical device for presenting data summaries based on subdivision of a circle into sectors that correspond to the percentage frequency for each category 12% 28% Coke 12% Coke Zero Pepsi Pepsi Max Sprite 20% 12% Mountain Dew 16% Fig. 1.2. Percentage distribution of soft drink purchases USING EXCEL: Watch Excel Statistics 15: Category Frequency Distribution w Pivot Table & Pie Chart by ExcellsFun at http://www.youtube.com/watch?v=-ERARVSfeuw 3. Rod Graph – a form of bar graph where the bars have zero width. It is especially used when the data are discrete Example. Scores of 12 psychiatric patients on a 5-point anxiety scale: Patient 1 2 3 4 5 6 7 8 9 10 11 12 score 4 3 5 1 4 4 2 5 4 3 4 5 Array: 1, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5 Distinct score values: 1, 2, 3, 4, 5 (Ordinal data) LEONARES, S. R. 7 Table 1.3. Frequency distribution of anxiety scores Score Frequency Percentage 1 1 8.3 2 1 8.3 3 2 16.7 4 5 41.7 5 3 25.0 Total 12 100.0 Rod Graph: 6 5 Frequency 4 3 2 1 0 0 1 2 3 4 5 Score Fig. 1.3. Anxiety scores of psychiatric patients SHAPES OF DISTRIBUTIONS: The rod graph (and later the histogram or frequency polygon) provide information about the shapes of the distributions – how the collected data are distributed over the possible values of the variable. There are three major types: 1. Symmetric – the shape of the left side of the distribution is a mirror image of the right side 2. Skewed – the two sides of the distribution are not mirror images of each other a. Positively skewed (skewed to the right, right-skewed) – scores tend to cluster toward the lower end of the scale (i.e., the smaller numbers) with increasingly fewer scores at the upper end of the scale (the larger numbers) b. Negatively skewed (skewed to the left, left-skewed)– most of the scores tend to occur toward the upper end of the scale while increasingly fewer score occur toward the lower end Note: the height of the graph represents the corresponding frequency of the point on the horizontal axis LEONARES, S. R. 8 6 Example: 5 4 Frequency 3 negatively skewed 2 longer left tail than right tail 1 more scores to the right of the center (score=3) than to the left 0 0 2 Score 4 6 NOTE: more of the shape will be discussed in relation to measures of central tendency (later topic) READ: https://www.mathbootcamps.com/common-shapes-of-distributions/ ORGANIZING AND SUMMARIZING QUANTITATIVE DATA > These procedures can be used for either continuous or discrete data Frequency Distribution for Quantitative Data Characteristics: 1. Non-overlapping class intervals (also called classes or intervals). use between 5 to 20 classes. use enough classes to show the variation in the data, but not so many that some contain only a few items. 2. Each class has a lower limit (the lowest possible value that can belong to it) and an upper limit (highest possible value that can belong to it Example: 11- 15 the class interval contains values from 11 to 15 (includes 11, 12, 13, 14, 15) 3. Uniform class width for all classes (also called interval size). This may be identified by the difference between two successive lower limits or two successive upper limits Can be generated by applications like Excel or statistical software Example: The following date correspond to the age of the eldest child of parents in a given class: 12 14 19 18 16 30 15 15 18 17 21 31 20 27 22 23 15 25 22 21 33 28 14 22 14 18 16 13 27 18 LEONARES, S. R. 9 Table 4. Frequency Distribution of Ages Age (years) Frequency 12 – 15 8 16 – 19 8 20 – 23 7 24 – 27 3 28– 31 3 32 - 35 1 Total 30 Comments: 1. there are 6 class intervals. 2. the class width is 4 (difference between 2 successive lower limits: e.g., 16-12, 32-28; or Difference between 2 successive upper limits; e.g., 31-27, 23-19) > For purposes of presenting the data using a graph, additional columns are needed: 1. Class boundaries remove the gaps between intervals (there is a gap of 1 between 12 – 15 and 16 – 19, etc) – this is especially necessary if your data are continuous no more gap between the first and second intervals: 11.5 – 15.5 and 15.5 – 19.5, etc… 2. Class marks are the midpoints of the class intervals (add the lower limit and upper limit, then divide by 2) Example: for the first interval: (12 + 15)/2 = 13.5 (do not round off) 3. Percentage = (frequency/total frequency) x 100% first interval: (8/30) x 100% = 26.7 Example: Using the age data (Table 4), the table is expanded below: Class Age (years) Class Marks Frequency Percentage Boundaries 12 – 15 11.5 – 15.5 13.5 8 26.7 16 – 19 15.5 – 19.5 17.5 8 26.7 20 – 23 19.5 – 23.5 21.5 7 23.3 24 – 27 23.5 – 27.5 25.5 3 10.0 28– 31 27.5 – 31.5 29.5 3 10.0 32 – 35 31.5 – 35.5 33.5 1 3.3 Total 30 100.0 Graphical Representations of Quantitative Frequency Distributions: 1. Histogram – A graph consisting of a series of vertical columns or rectangles with no gaps between bars each bar is drawn with a base equal to the class boundaries and a height corresponding to the class frequency a suitable graph for representing data obtained from continuous variables. LEONARES, S. R. 10 9 8 7 6 Frequency 5 4 3 2 1 0 // 11.5 – 15.5 15.5 – 19.5 19.5 – 23.5 23.5 – 27.5 27.5 – 31.5 31.5 – 35.5 Age (years) Fig. 1.4 Age distribution of the eldest children Comment: Consider the boundary line between the bars of the third and fourth intervals to be the middle value (dividing line between the left and right sides). The shape of the distribution of ages is positively skewed 2. Frequency Polygon – Constructed by plotting class marks (X) against class frequencies (Y) and connecting the consecutive points by straight lines to close the frequency polygon, additional class marks ( 9.5 and 37.5) are added to both ends of the distribution, each with zero frequency 9 8 7 6 Frequency 5 4 3 2 1 0 9.5 13.5 17.5 21.5 25.5 29.5 33.5 37.5 Age(years) Comment: The shape is positively skewed – more values concentrated to the left of the blue line than to the right. USING EXCEL: Watch the following videos: A. Excel Campus – Jon 1. Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 1) https://www.youtube.com/watch?v=9NUjHBNWe9M 2. Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 2) LEONARES, S. R. 11 https://www.youtube.com/watch?v=g530cnFfk8Y 3. How to Create a Dashboard Using Pivot Tables and Charts in Excel (Part 3) https://www.youtube.com/watch?v=FyggutiBKvU B. by DannyRocksExcels: 1. Two Ways to Create a Frequency Distribution Report in Excel, http://www.youtube.com/watch?v=nh5ObAKfj1o&feature=fvsr (preference: use of pivot functions) 2. Use an Excel Table to Group Data by Age Bracket, http://www.youtube.com/watch?v=GZvJniF6IPY EXERCISES (using the Excel application) 1. Mari’s Steakhouse uses a questionnaire to ask customers how they rate the server, food quality, cocktails, prices, and atmosphere at the restaurant. Each characteristic is rated on a scale of outstanding (O), very good (V), good (G), average (A), and poor (P). Construct a frequency distribution, bar graph, and pie chart to summarize the following data collected on food quality. What is your feeling about the food quality ratings at the restaurant? G O V G A O V O V G O V A V O P V O G A O O O G O V V A G O V P V O O G O O V O G A O V O O G V A G 2. The following are the final examination test scores of 50 statistics students. 68 45 38 52 54 43 69 44 52 64 55 56 50 54 38 40 54 55 51 55 65 59 37 57 46 29 64 58 53 37 42 56 42 49 49 43 41 55 49 47 64 42 53 63 33 60 63 41 48 50 a. Construct a frequency distribution using 7 classes. b. Develop a histogram and a frequency polygon. c. Determine the shape of the distribution. 3. The following data are the scores of 50 individuals who answered a 150-item aptitude test as a requirement for a job application. 112 107 97 69 72 100 115 106 73 73 86 76 92 119 98 126 124 127 118 128 106 84 82 83 134 132 104 94 75 92 92 100 96 108 85 98 115 81 102 91 76 68 113 95 106 80 81 141 95 119 a. Construct a frequency distribution for this data set using 8 classes. b. Construct a histogram and a frequency polygon. c. Determine the shape of the distribution. LEONARES, S. R. 12 UNIVERSITY OF ST. LA SALLE Yu An Log College of Business and Accountancy BSTAT – BUSINESS STATISTICS First Semester, Ay 2020 – 2021 HANDOUTS 3 MEASURES OF CENTRAL TENDENCY & VARIABILITY Recall: Statistics involves a body of techniques and procedures dealing with the collection, organization, analysis, interpretation, and presentation of information that can be stated numerically. Summarizing data involves using statistical tools and procedures appropriate for answering a research problem or objective. The following terms are needed need to be differentiated: Measure – a numerical representation of a particular characteristic (variable of the study) of the group being studied Parameter – A measure calculated from the population; usually represented by letters of the Greek alphabet Statistic – A measure calculated from the sample; usually represented by letters of the English alphabet Summaries of QUALITATIVE DATA: Qualitative data are summarized using the following measures: proportions ( also called relative frequencies) percentages For example: the variable sex is coded as M–0 F –1 Remark: Since “sex” is a qualitative variable and the codes 0 and 1 represent nominal data, then it is not appropriate to consider them as numbers with values, so it is not correct to apply arithmetic operations such as addition and division to get the “average sex” since it will not make any sense for a qualitative variable; Rather, use proportion (or percentage) of males (or females) in the group Say, “Two out of 10 students are male,” or “twenty percent of the students are males” Summaries of QUANTITATIVE DATA: Quantitative data are usually summarized in terms of the center and spread of the distribution. The center of the distribution can be identified using an appropriate measure of central tendency or location. LEONARES, S. R. 1 MEASURES OF CENTRAL TENDENCY OR LOCATION (AVERAGES) A measure of central tendency or location is representative value of the data set the value around which most of the data points are found (ARITHMETIC) MEAN computed by summing all the data values in the sample or population and dividing the sum by the number of observations (usually referred to as “average”) Most important measure representing the center of the distribution if the distribution is symmetric data must be at least interval Most stable measure of location, especially for large data sets When n is small, the mean is very sensitive to extreme values Differentiate between the population and sample means by their symbols: Population Mean: x i , where x i is the ith score or observation, and N is the number N of observations in the population (the parameter is , the Greek letter “mu”) Sample Mean: x x i , where x i is the ith score or observation, and n is the number of n observations in the sample (the statistic is 𝑥̅ , and is read as “x-bar”) Why differentiate between and 𝑥̅ : if the research procedure is a population study, then a populations symbol (parameter) must be used; if it is a sample study, then a sample symbol (statistic) must be used. This will be a very important distinction in inferential statistics. That is why it is important to determine at the beginning of the research process if you will be doing a population of sample study, since it will have a bearing in the use of notations/symbols for parameters or statistics. Example 1: During a particular summer month, the eight salespeople in an appliance store sold the following number of central air-conditioning units: 8, 11, 5, 14, 8, 11, 16, 11. Considering this month as the statistical population of interest, the mean number of units sold is x i 84 10.5 central a / c units N 8 Why ? Because the problem stated that the month should be considered as a statistical population of interest. LEONARES, S. R. 2 WEIGHTED MEAN Also called the weighted average an arithmetic mean in which each value is weighted according to its importance in the overall group formulas for the population, and sample weighted means are identical: w or X w wX w Operationally, each value in the group (X) is multiplied by the appropriate weight factor (w), and the products are then summed and divided by the sum of the weights. Example 2: In a multiproduct company, the profit margins for the company’s four product lines during the past fiscal year were: line A, 4.2percent; line B, 5.5 percent; line C, 7.4 percent; and line D, 10.1 percent. The unweighted mean profit margin is x 27.2 6.80% N 4 However, unless the four products are equal in sales, this unweighted average is incorrect. Assuming the sales totals in the following table which are not all equal, the weighted mean correctly describes the overall average. Product Line Profit Margin, X (%) Sales, in Php (w) wX A 4.2 30,000,000 126,000,000 B 5.5 20,000,000 110,000,000 C 7.4 5,000,000 37,000,000 D 10.1 3,000,000 30,300,000 Total Php58,000,000 Php303,300,000 Hence, the weighted mean profit margin is 303,300,000 w 5.22% 58,000,000 Remark: The weighted mean is used in computing for final grades when the number of units of the subjects are not equal. Each grade is multiplied by the number of units of the subject, and the sum of the (grades x no. of units) is divided by the total number of units taken. LEONARES, S. R. 3 MEDIAN Center of an array (arrangement of the data from lowest to highest) Divides the array into two equal parts Useful for summarizing skewed distributions because it is not sensitive to extreme values Equal to the mean for symmetric distributions Data must be at least ordinal If N (or n) is odd, the median is the middle number of the array If N (or n) is even, the median is the mean of the two middle values Population Median: ~ (read “mu-tilde”) Sample Median: ~ x (read “x-tilde”) Example 3: The eight salespeople described in Example 1 sold the following number of central air- conditioning units, in ascending order: 5, 8, 8, 11, 11, 11, 14, 16. Find the median. Array: 5, 8, 8, 11, 11, 11, 14, 16 ~ 11 11 11 central a/c units 2 Since the number of data values is even (N = 8), then the value of the median is the mean of the two middle values, which are the fourth and fifth values in the ordered group. Both these values equal “11” in this case, so adding the two 11’s and dividing by 2 gives the median which is equal to 11. Note that there is an equal number of data points below and above the median (5, 8, 8, 11 are below; 11, 11, 14, 16 are above). Example 4: The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the median. First form the array: 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3 Since there are 9 data values (odd), then there will only be one middle value. 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3 𝑥 = 3.1 seconds Why 𝒙? Because the problem specifically identifies the group as a random sample. NOTE: When the problem does not specifically indicate whether the group involved is a sample or population, treat the data set as a sample. LEONARES, S. R. 4 Recall Example 1: During a particular summer month, the eight salespeople in an appliance store sold the following number of central air-conditioning units: 8, 11, 5, 14, 8, 11, 16, 11. Considering this month as the statistical population of interest, a. the mean number of units sold is x i 84 10.5 central a / c units N 8 b. the median value from Example 3 is ~ 11 11 11 central a/c units 2 Dot plot: The mean and median are relatively close to each other. 5 6 7 8 9 10 11 12 13 14 15 16 The mean and the median values would be considered to be good representatives of the data set since they are located in the center of the distribution (where the points are). What if, instead of 16, the highest value is 160? Then the last point of the dot plot would be very far from the rest of the points (extremely high value) – it can also be called an outlier. Solution with the outlier, 160: Array: 5, 8, 8, 11, 11, 11, 14, 160 Then: x i 228 28.5 central a / c units N 8 ~ 11 11 11 central a/c units 2 The resulting value of the mean is not found at the center of where the points are (28.5 is far from the majority of the points), while the median remains the same. The value of the mean is affected if there are extreme values in the distribution, hence, it cannot be used to represent the distribution if the shape is skewed. That is why, one condition for its use as a representative value is that the shape must be symmetric. On the other hand, the median has not changed, because only the middle value (if n is odd) or the mean of the two middle values (if n is even) is used; the extreme value is not used in determining the median. Therefore, the median is a better representative value if the shape of the distribution is skewed. LEONARES, S. R. 5 MODE Value in the data set which has the highest frequency (occurs most often) Can be applied to any measurement level May not exist (the data set may not have a mode if all the values occur with the same frequency) May not be unique, if it exists (a data set may have more than one value which have the same highest fequency Related to the concept of a peak or peaks in the frequency distribution Unimodal – one peak Bimodal – two peaks, etc. Population Mode: Mo Sample Mode: mo Example 5: The eight salespeople described in Example 1 sold the following number of central air- conditioning units: 8, 11, 5, 14, 8, 11, 16, and 11. Find the mode. Mo =11 central air-conditioning units Example 6: The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Find the mode. Since all values occur only once (they have the same frequency), then this distribution has no mode or we say that the mode does not exist. This different from saying that the mode is 0 (why?) RELATIONSHIP BETWEEN THE MEAN AND THE MEDIAN: Note that the shape of the distribution is important in choosing the most appropriate measure of central tendency (and in other measures and tests as well). Hence, to determine the shape and there is no graph to base it on, comparing the mean and median values will determine the shape: a. symmetric distribution: mean = median b. positively skewed distribution: mean > median c. negatively skewed distribution: mean 0 => positively skewed if SK < 0 => negatively skewed If SK = 0 => symmetric Rule of thumb (Bulmer, 1979): If SK is less than −1 or greater than +1, the distribution is highly skewed. between −1 and −½ or between +½ and +1, the distribution is moderately skewed. between −½ and +½, the distribution is approximately symmetric. D. EMPIRICAL RULE When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean, that is, o Approximately 68% of the data values will be within 1 standard deviation of the mean ( ± 1) = ( - 1 , + 1). o Approximately 95% of the data values will be within 2 standard deviations of the mean ( ± 2) = ( - 2 , + 2). o Approximately 99.7% of the data values will be within 3 standard deviations of the mean ( ± 3) = ( - 3 , + 3). LEONARES, S. R. 15 Remarks on the bell-shaped curve (also called the normal curve): 1. the horizontal line can go much lower than - 4 and much higher than + 4. 2. the total area under the curve and above the horizontal line is 1 or 100% 3. since it is symmetric, the percentage between similarly distanced points on the x-axis from the mean are equal ( see above figure) 4. 0.15% (on the left of the figure) is the area from - 3 and below; 0.15% (on the right of the figure) is from + 3 and above. Example: Liquid detergent cartons are filled automatically on a production line. Filling weights frequently have a bell-shaped distribution. If the mean filling weight is 16.00 ounces and the standard deviation is 0.25 ounces, use the empirical rule to draw conclusions about the distribution of filling weights. = 16.00 oz ; = 0.25 oz LEONARES, S. R. 16 ± 1 : 16.00 ± 0.25 (16.00 - 0.25, 16.00 + 0.25) (15.75, 16.25) 68% of the liquid detergent cartons have filling weights between 15.75 oz and 16.25 oz ± 2 : 16.00 ± (2)0.25 16.00 ± 0.50 (15.50, 16.50) 95% of the liquid detergent cartons have filling weights between 15.50 oz and 16.50 oz ± 3 : 16.00 ± (3)0.25 16.00 ± 0.75 (15.25, 16.75) 99.7% of the liquid detergent cartons have filling weights between 15.25 oz and 16.75 oz EXERCISES 1. A goal of management is to help their company earn as much as possible relative to the capital invested. One measure of success is return on equity – the ratio of net income to stockholder’s equity. Shown here are return on equity percentages for 25 companies. Find the range, variance, and standard deviation. 9.0 19.6 22.9 41.6 11.4 15.8 52.7 17.3 12.3 5.1 17.3 31.1 9.6 8.6 11.2 12.8 12.2 14.5 9.2 16.6 5.0 30.3 14.7 19.2 6.2 2. During a 30-day period, the daily number of cars rented of a car rental company are as follows: 7 10 6 7 9 4 7 9 9 8 5 5 7 8 4 6 9 7 12 7 9 10 4 7 5 9 8 9 5 7 Find the range, variance, and standard deviation. 3. A manufacturing firm regularly places orders with two different suppliers, A and B. The following data are the number of days required to fill orders for these suppliers. Supplier A: 11 10 9 10 11 11 10 11 10 10 Supplier B: 8 10 13 7 10 11 10 7 15 12 Determine which supplier provides the more consistent and reliable delivery times. Use the range and standard deviation. Since you are comparing the two, why just use the standard deviation and not compute for the coefficient of variation? LEONARES, S. R. 17 4. A production department uses a sampling procedure to test the quality of newly produced items. The department employs the following decision rule at an inspection station: If a sample of 14 items has a variance of more than.005, the production line must be shut down for repairs. Suppose the following data have been collected: 3.43 3.45 3.43 3.48 3.52 3.50 3.39 3.48 3.41 3.38 3.49 3.45 3.51 3.50 Should the production line be shut down? Why or why not? 5. Two friends want to take a summer holiday before going to college in the autumn. They are looking for somewhere with plenty of clubs where they can party all night. Unfortunately they have left it rather late to book and there are only two resorts, Medlena and Bistry, available within their budget. When they ask about the ages of the holiday-makers at these resorts their travel agent says the only thing he can tell them is that that the mean age of people going to Medlena is 19 whereas the mean age of visitors to Bistry is 22. Just as they are about to book holidays in Medlena because it seems to attract the sort of young crowd they want to be with the travel agent says. ‘I’ve got some more figures, the standard deviation of the ages of visitors to Medlena is 8 and the standard deviation of the ages of visitors to Bistry is 2’. Should they change their minds on the basis of this new information, and if so, why? 6. Many national academic achievement and aptitude tests, such as the SAT, report standardized test scores with the mean for the normative group used to establish scoring standards converted to 500 with a standard deviation of 100. Suppose that the distribution of scores for such a test is known to be approximately normally distributed. Determine the approximate percentage of reported scores that would be a. between 400 and 600 b. between 500 and 700 c. greater than 700 d. less than 200 Hint: Draw the bell-shaped curve and replace the values of and on the horizontal axis: 7. A SAT test taker (refer to #6) got a score of 625. What is his standard score? 8. The same student (in #7) got the same score (625) in a different test, the mean of which is 450 and standard deviation 150. In which test did this student fare better? LEONARES, S. R. 18