BSTAT Business Statistics Handouts 2 PDF

UNIVERSITY OF ST. LA SALLE Yu An Log College of Business and Accountancy BSTAT – BUSINESS STATISTICS FIRST SEMESTER, AY 2020 – 2021 HANDO...

UNIVERSITY OF ST. LA SALLE Yu An Log College of Business and Accountancy BSTAT – BUSINESS STATISTICS FIRST SEMESTER, AY 2020 – 2021 HANDOUTS 2 SAMPLING, DATA COLLECTION & ORGANIZATION Defn: Sampling – the process of selecting the subjects of the population to be included in the sample Why Sample? Why should we not use the population as the focus of study? There are at least four major reasons to sample. First, it is usually too costly to test the entire population. The second reason to sample is that it may be impossible to test the entire population. For example, let us say that we wanted to test the 5-HIAA (a serotonergic metabolite) levels in the cerebrospinal fluid (CSF) of depressed individuals. There are far too many individuals who do not make it into the mental health system to even be identified as depressed, let alone to test their CSF. The third reason to sample is that testing the entire population often produces error. Thus, sampling may be more accurate. Perhaps an example will help clarify this point. Say researchers wanted to examine the effectiveness of a new drug on Alzheimer's disease. One dependent variable that could be used is an Activities of Daily Living Checklist. In other words, it is a measure of functioning on a day to day basis. In this experiment, it would make sense to have as few of people rating the patients as possible. If one individual rates the entire sample, there will be some measure of consistency from one patient to the next. If many raters are used, this introduces a source of error. These raters may all use a slightly different criteria for judging Activities of Daily Living. Thus, as in this example, it would be problematic to study an entire population. The final reason to sample is that testing may be destructive. It makes no sense to lesion the lateral hypothalamus of all rats to determine if it has an effect on food intake. We can get that information from operating on a small sample of rats. Also, you probably would not want to buy a car that had the door slammed five hundred thousand time or had been crash tested. Rather, you probably would want to purchase the car that did not make it into either of those samples. It is extremely important to choose a sample that is truly representative of the population so that the inferences derived from the sample can be generalized back to the population of interest. Improper and biased sampling is the primary reason for often divergent and erroneous inferences reported in opinion polls and exit polls conducted by different polling groups LEONARES, S. R. 1 Types of Sampling: A. Probability sampling  each element of the population is given a chance (non-zero probability) of being included in the sample  minimizes, if not eliminates, selection bias  ideal if generalizability of results is important for your study  inferential statistical procedures can be used for arriving at generalizations/conclusions about the population based on sample data 1. Simple Random Each element of the population is given an equal chance of being included in the sample Most basic probability sampling procedure Foundation of all probability sampling procedures When to use: – The population is homogeneous – A sampling frame is available (sampling frame – complete and updated list of the population) Procedure: – Lottery – Use of random number generators 2. Systematic Random Selecting every kth element of the population When to use: – When the population is homogenous and there is no suspicion of a trend or pattern in the frame or geographical layout – A sampling frame is available Procedure: i. Determine the sampling interval, k = N/n (rounded to the nearest interval) ii. Identify the random start, rs: 1 ≤ rs ≤ k (randomly drawing a value between 1 and k) iii. Determine the number of the elements to be included in the sample: rs, rs + k, rs + 2k, … Example: N (population size) = 10,100 n (sample size) = 150 k = N/n = 10,100/150 = 67.3  67 rs = a randomly chosen number between 1 and 67 suppose rs = 43 => #43 in the sampling frame becomes the first to be included in the sample second = rs + k = 43 + 67 = #110 third = rs + 2k = 43 + 2(67) or 110 + 67 = #177, etc continue getting the numbers until the sample size of 150 is reached. LEONARES, S. R. 2 3. Stratified Random selecting random samples from mutually exclusive subpopulations, or strata, of the population. When to use: – When the population is heterogeneous but can be subdivided into homogeneous subgroups or strata – A sampling frame is available for each stratum Procedure: i. Determine the proportion of each stratum relative to the population ii. Identify the stratum sample sizes using proportional allocation iii. Select the samples from each stratum using either simple or systematic random sampling Example: Among the 250 employees of the local office of an international insurance company, 182 are Filipinos, 51 are Chinese, and 17 are Americans. If we use proportional allocation to select a stratified random grievance committee of 15 employees, how many employees must we take from each race? Solution: Race (i) Ni % ni Filipino 182 72.8 15 x 72.8% = 11 Chinese 51 20.4 15 x 20.4% = 3 American 17 6.8 15 x 6.8% = 1 Total 250 100 15 Therefore, 11 Filipinos, 3 Chinese, and 1 American will compose the grievance committee. These will have to be randomly selected from among each of the subgroups. 4. Cluster Random Selecting clusters of elements rather than individual elements When to use: – when "natural" groupings are evident in a statistical population – a sampling frame is not available Procedure: i. Divide the population into clusters (M =total number of clusters) ii. Randomly select m clusters iii. Include all elements within the selected clusters to form the resulting sample 5. Multi-stage random sampling Repeated cluster sampling LEONARES, S. R. 3 B. Non-probability sampling  not all elements of the population are given a chance of being included in the sample  prone to selection bias  however, there may be unique circumstances where non-probability sampling can also be justified (e.g., medical researches), although the generalizability of the conclusion is not assured  inferential statistical procedures cannot be used for arriving at generalizations/conclusions about the population based on sample data 1. Convenience / Voluntary /Haphazard/Accidental Sample elements are selected because they are available 2. Judgmental/Purposive/Expert The researcher selects the sample based on his judgment as to who best fit the established criteria 3. Quota Selecting sample elements nonrandomly according to some fixed quota 4. Snowball Especially useful when you are trying to reach populations that are inaccessible or hard to find Problems in Sampling: There are several potential sampling problems. When designing a study, a sampling procedure is also developed including the potential sampling frame. Several problems may exist within the sampling frame. 1. Missing elements - individuals who should be on your list but for some reason are not on the list. For example, if my population consists of all individuals living in a particular city and I use the phone directory as my sampling frame or list, I will miss individuals with unlisted numbers or who can not afford a phone. 2. Foreign elements - Elements which should not be included in my population and sample appear on my sampling list. Thus, if I were to use property records to create my list of individuals living within a particular city, landlords who live elsewhere would be foreign elements. In this case, renters would be missing elements. 3. Duplicates - these are elements who appear more than once on the sampling frame. For example, if I am a researcher studying patient satisfaction with emergency room care, I may potentially include the same patient more than once in my study. If the patients are completing a patient satisfaction questionnaire, I need to make sure that patients are aware that if they have completed the questionnaire previously, they should not complete it again. If they complete it more that once, their second set of data represents a duplicate. Read: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context= oa_textbooks (Chapter 8 – Sampling) LEONARES, S. R. 4 DATA COLLECTION PROCEDURES 1. Interview There is interaction between interviewer and respondent Most important method of data collection Some advantages: o Clarifications about ambiguous questions/answers can be made o More in-depth information can be generated Some disadvantages: o Time-consuming o Costly o Responses may be influenced by the interviewer 2. Questionnaire No interaction between facilitator and respondent about the subject matter Respondent personally answers the questions on survey forms Some advantages: o Less costly o Less time- consuming o Responses are not influenced by the interviewer o Respondents answer the questions with relative anonymity; may answer more truthfully Some disadvantages: o Not effective if the respondent is illiterate o Clarifications about vague questions cannot be made o Respondents may misinterpret the questions o Intended respondents may not personally answer the forms; may request other people to respond o Low rate of returns 3. Experimentation a controlled study in which the researcher attempts to understand cause-and-effect relationships The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives. 4. Observation Like experiments, observational studies attempt to understand cause-and-effect relationships Unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives. Also used for behavioral, attitudinal studies Read: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context= oa_textbooks (Chapters 9 & 10) LEONARES, S. R. 5 ORGANIZATION AND PRESENTATION OF DATA What is data organization? - a systematic arrangement of summarizing raw data so it is easier to analyze and study ORGANIZING AND SUMMARIZING QUALITATIVE DATA Frequency Distribution - A tabular summary of data showing the number (frequency) of items in each of several non-overlapping classes. Example: The following data were obtained from a sample of 50 soft drink purchases. Construct a frequency distribution to summarize the data. Variable: soft drink purchased Coke Coke Zero Pepsi Max Pepsi Pepsi Coke Zero Coke Zero Sprite Coke Coke Pepsi Coke Zero Pepsi Max Coke Zero Pepsi Max Pepsi Max Sprite Sprite Coke Zero Pepsi Max Pepsi Max Coke Coke Pepsi Coke Sprite Coke Coke Mountain Dew Mountain Dew Mountain Dew Coke Pepsi Max Coke Pepsi Mountain Dew Pepsi Pepsi Max Mountain Dew Pepsi Max Coke Pepsi Coke Pepsi Max Sprite Coke Pepsi Coke Sprite Mountain Dew Salient points of a frequency distribution table: a. appropriate label headings b. categories of the variable being organized should be non-overlapping example: variable: soft drink brand categories: Coke, Coke Zero, Pepsi, Pepsi Max, Sprite, Mountain Dew b. frequency – number of times the category appeared in the data set c. percent – (frequency of the category  total) x 100% Table 1. Distribution of Soft Drink Purchases Soft Drink Brand Frequency Percent Coke 14 28 Coke Zero 6 12 Pepsi 8 16 Pepsi Max 10 20 Sprite 6 12 Mountain Dew 6 12 Total 50 100 Note: Follow the APE format in presenting data using a table. LEONARES, S. R. 6 Graphical presentations of qualitative data: 1. Bar graph – A graphical device for depicting qualitative data that have been summarized in a frequency, or percent distribution 16 14 No. of bottles bought 12 10 8 6 4 2 0 Coke Coke Zero Pepsi Pepsi Max Sprite Mountain Dew Soft drink brand Fig. 1.1. Soft drink purchases of buyers 2. Pie chart – A graphical device for presenting data summaries based on subdivision of a circle into sectors that correspond to the percentage frequency for each category 12% 28% Coke 12% Coke Zero Pepsi Pepsi Max Sprite 20% 12% Mountain Dew 16% Fig. 1.2. Percentage distribution of soft drink purchases USING EXCEL: Watch Excel Statistics 15: Category Frequency Distribution w Pivot Table & Pie Chart by ExcellsFun at http://www.youtube.com/watch?v=-ERARVSfeuw 3. Rod Graph – a form of bar graph where the bars have zero width. It is especially used when the data are discrete Example. Scores of 12 psychiatric patients on a 5-point anxiety scale: Patient 1 2 3 4 5 6 7 8 9 10 11 12 score 4 3 5 1 4 4 2 5 4 3 4 5 Array: 1, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5 Distinct score values: 1, 2, 3, 4, 5 (Ordinal data) LEONARES, S. R. 7 Table 1.3. Frequency distribution of anxiety scores Score Frequency Percentage 1 1 8.3 2 1 8.3 3 2 16.7 4 5 41.7 5 3 25.0 Total 12 100.0 Rod Graph: 6 5 Frequency 4 3 2 1 0 0 1 2 3 4 5 Score Fig. 1.3. Anxiety scores of psychiatric patients SHAPES OF DISTRIBUTIONS: The rod graph (and later the histogram or frequency polygon) provide information about the shapes of the distributions – how the collected data are distributed over the possible values of the variable. There are three major types: 1. Symmetric – the shape of the left side of the distribution is a mirror image of the right side 2. Skewed – the two sides of the distribution are not mirror images of each other a. Positively skewed (skewed to the right, right-skewed) – scores tend to cluster toward the lower end of the scale (i.e., the smaller numbers) with increasingly fewer scores at the upper end of the scale (the larger numbers) b. Negatively skewed (skewed to the left, left-skewed)– most of the scores tend to occur toward the upper end of the scale while increasingly fewer score occur toward the lower end Note: the height of the graph represents the corresponding frequency of the point on the horizontal axis LEONARES, S. R. 8 6 Example: 5 4 Frequency 3  negatively skewed 2  longer left tail than right tail 1  more scores to the right of the center (score=3) than to the left 0 0 2 Score 4 6 NOTE: more of the shape will be discussed in relation to measures of central tendency (later topic) READ: https://www.mathbootcamps.com/common-shapes-of-distributions/ ORGANIZING AND SUMMARIZING QUANTITATIVE DATA > These procedures can be used for either continuous or discrete data Frequency Distribution for Quantitative Data Characteristics: 1. Non-overlapping class intervals (also called classes or intervals). use between 5 to 20 classes. use enough classes to show the variation in the data, but not so many that some contain only a few items. 2. Each class has a lower limit (the lowest possible value that can belong to it) and an upper limit (highest possible value that can belong to it Example: 11- 15  the class interval contains values from 11 to 15 (includes 11, 12, 13, 14, 15) 3. Uniform class width for all classes (also called interval size).  This may be identified by the difference between two successive lower limits or two successive upper limits  Can be generated by applications like Excel or statistical software Example: The following date correspond to the age of the eldest child of parents in a given class: 12 14 19 18 16 30 15 15 18 17 21 31 20 27 22 23 15 25 22 21 33 28 14 22 14 18 16 13 27 18 LEONARES, S. R. 9 Table 4. Frequency Distribution of Ages Age (years) Frequency 12 – 15 8 16 – 19 8 20 – 23 7 24 – 27 3 28– 31 3 32 - 35 1 Total 30 Comments: 1. there are 6 class intervals. 2. the class width is 4 (difference between 2 successive lower limits: e.g., 16-12, 32-28; or Difference between 2 successive upper limits; e.g., 31-27, 23-19) > For purposes of presenting the data using a graph, additional columns are needed: 1. Class boundaries remove the gaps between intervals (there is a gap of 1 between 12 – 15 and 16 – 19, etc) – this is especially necessary if your data are continuous  no more gap between the first and second intervals: 11.5 – 15.5 and 15.5 – 19.5, etc… 2. Class marks are the midpoints of the class intervals (add the lower limit and upper limit, then divide by 2)  Example: for the first interval: (12 + 15)/2 = 13.5 (do not round off) 3. Percentage = (frequency/total frequency) x 100%  first interval: (8/30) x 100% = 26.7 Example: Using the age data (Table 4), the table is expanded below: Class Age (years) Class Marks Frequency Percentage Boundaries 12 – 15 11.5 – 15.5 13.5 8 26.7 16 – 19 15.5 – 19.5 17.5 8 26.7 20 – 23 19.5 – 23.5 21.5 7 23.3 24 – 27 23.5 – 27.5 25.5 3 10.0 28– 31 27.5 – 31.5 29.5 3 10.0 32 – 35 31.5 – 35.5 33.5 1 3.3 Total 30 100.0 Graphical Representations of Quantitative Frequency Distributions: 1. Histogram – A graph consisting of a series of vertical columns or rectangles with no gaps between bars each bar is drawn with a base equal to the class boundaries and a height corresponding to the class frequency a suitable graph for representing data obtained from continuous variables. LEONARES, S. R. 10 9 8 7 6 Frequency 5 4 3 2 1 0 // 11.5 – 15.5 15.5 – 19.5 19.5 – 23.5 23.5 – 27.5 27.5 – 31.5 31.5 – 35.5 Age (years) Fig. 1.4 Age distribution of the eldest children Comment: Consider the boundary line between the bars of the third and fourth intervals to be the middle value (dividing line between the left and right sides). The shape of the distribution of ages is positively skewed 2. Frequency Polygon – Constructed by plotting class marks (X) against class frequencies (Y) and connecting the consecutive points by straight lines to close the frequency polygon, additional class marks ( 9.5 and 37.5) are added to both ends of the distribution, each with zero frequency 9 8 7 6 Frequency 5 4 3 2 1 0 9.5 13.5 17.5 21.5 25.5 29.5 33.5 37.5 Age(years) Comment: The shape is positively skewed – more values concentrated to the left of the blue line than to the right. USING EXCEL: Watch the following videos: A. Excel Campus – Jon 1. Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 1) https://www.youtube.com/watch?v=9NUjHBNWe9M 2. Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 2) LEONARES, S. R. 11 https://www.youtube.com/watch?v=g530cnFfk8Y 3. How to Create a Dashboard Using Pivot Tables and Charts in Excel (Part 3) https://www.youtube.com/watch?v=FyggutiBKvU B. by DannyRocksExcels: 1. Two Ways to Create a Frequency Distribution Report in Excel, http://www.youtube.com/watch?v=nh5ObAKfj1o&feature=fvsr (preference: use of pivot functions) 2. Use an Excel Table to Group Data by Age Bracket, http://www.youtube.com/watch?v=GZvJniF6IPY EXERCISES (using the Excel application) 1. Mari’s Steakhouse uses a questionnaire to ask customers how they rate the server, food quality, cocktails, prices, and atmosphere at the restaurant. Each characteristic is rated on a scale of outstanding (O), very good (V), good (G), average (A), and poor (P). Construct a frequency distribution, bar graph, and pie chart to summarize the following data collected on food quality. What is your feeling about the food quality ratings at the restaurant? G O V G A O V O V G O V A V O P V O G A O O O G O V V A G O V P V O O G O O V O G A O V O O G V A G 2. The following are the final examination test scores of 50 statistics students. 68 45 38 52 54 43 69 44 52 64 55 56 50 54 38 40 54 55 51 55 65 59 37 57 46 29 64 58 53 37 42 56 42 49 49 43 41 55 49 47 64 42 53 63 33 60 63 41 48 50 a. Construct a frequency distribution using 7 classes. b. Develop a histogram and a frequency polygon. c. Determine the shape of the distribution. 3. The following data are the scores of 50 individuals who answered a 150-item aptitude test as a requirement for a job application. 112 107 97 69 72 100 115 106 73 73 86 76 92 119 98 126 124 127 118 128 106 84 82 83 134 132 104 94 75 92 92 100 96 108 85 98 115 81 102 91 76 68 113 95 106 80 81 141 95 119 a. Construct a frequency distribution for this data set using 8 classes. b. Construct a histogram and a frequency polygon. c. Determine the shape of the distribution. LEONARES, S. R. 12

BSTAT Business Statistics Handouts 2 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue