STT041 Chapters 1-2: Introduction to Statistics PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to statistics, covering topics such as data classification, populations and samples, parameters and statistics, and different types of statistical analysis. It includes examples and exercises to help students understand the concepts. Chapters 1 and 2 focus on essential principles of descriptive statistics and experimental design.
Full Transcript
Chapter 1: Introduction to Statistics Almost every day you are exposed to statistics. For instance, consider the nexb two statements. o According to a survey, more than 7 in 10 Americans say a nursing career is a prestigious occupation. (Source: The Harris Poll) o "Social media co...
Chapter 1: Introduction to Statistics Almost every day you are exposed to statistics. For instance, consider the nexb two statements. o According to a survey, more than 7 in 10 Americans say a nursing career is a prestigious occupation. (Source: The Harris Poll) o "Social media consumes kids today as well, as more score their first sccial media accounts at an average age of 11.4 years oid." (Source: Influence Central's 2016 Digital Trends Study) By learning the concepts in this text, you will gain the tools to become an informed consumer, understand statistical studies, conduct statistical research and sharpen your critical thinking skills. 1.1 An Overview of Statistics 1.1.1 Data and Statistics Data consists of information coming from observations, counts, measurements, or re- sponses. Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions L.L.z Population and Sample A population is the collection of all outcomes, responses, measurement, or counts that are of interest. A sample is a subset of a population. Example: In a recent survey, 250 college students at Union Coliege were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample. 1-.1.3 Parameter and Statistics A parameter is a numerica,l description of a population characteristic. A statistic is a numerical description of a sample characteristic. Example: Decide whether the numerical value describes a population parameter or a sample statistic. a.) A recent survey of a sample of 450 college students reported that the average weekly income for students is P1765. b.) The average weekly income for all students is P1500. L.L.A Branches of Statistics The study of statistics has two major branches: descriptive statistics and inferential statistics. Descriptive statistics involves the organization, summarization, and display of data. Inferential statistics involves using a sample to dravr conclusions about a population. Example: In a recent study, volunteers who had less than 6 hours of sleep were four times more likelv to answer incorrectly on a science test than were participants who had at least 8 hours of sleep. Decide which part is the descriptive statistic and what conclusion might be drawn using inferentiai. statistics. Answer: The statement "four times more likely to answer incorrectly" is a descriptive statistic. An inference drawn from the sample is that all individuals sleeping less than 6 hours a,re more likeiy to answer science question incorrectly than individuals who sleep at least 8 hours L.2 Data Classiflcation L.z.L Types of Data Quantitative data consists of attributes, labels, or non numerical entries. Qualitative data consists of numerical measurements or counts. Example: The grade point averages of five students are listed in the table. Which data are qualitative data and which are quantitative data? Student GPA Sally 3.22 Bob 3.98 Cindy 2.75 Mark 2.24 Kathy 3.84 STTO41 STATISTiCAL ANALYSIS - 1-.2.I-.1- Levels of Measurement The level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio. Nominal level of measurement is calculated using names, Iabels, or qualities. No math- ematical computations can be made at this level. The data cannot be arranged in an ordering scheme. There is no criterion as to which values can be identified as greater than or less than other values.Examples ate calors i.n the fl,ag, names of students i,n STT101 class, Tertbooks used i,n the current semester. Ordinal level of measurement is arranged in order, but differences between data entries are not meaningful.Examples are Class standi,ngs - freshmen, sophomore, junior, senior, and Customer sati,sfaction surueys - Very dissatisfied, Somewhat dissatisfied, Neutral, Somewhat satisfied, Very satisfied. Interval level of measurement is arranged in order, the differences between data entries can be calculated.This is the same as the ordinal level, with an additionai property that we can determine meaningful amounts of differences between the data. Data at this levei may lack an inherent zero starting point. Examples ar.e Temperatures - Celsi,us and Farenhei,t and years on a ti,mel'ine. Ratio level of measurement is is an interval leve1 modified to inciude the inherent zero starting point. The difference and ratios of data are meaningful. This is also the highest level of measurement. Examples are age, t'ime, wei,ght and length. 1.3 Experimental Design 1.3.1 Methods of Data Collection In an observational study, a resea^rcher observes and measures characteristics of inter- est of part of a population. In an experiment, a treatment is applied to part of a population, and responses are observed. A simulation is the use of a mathematical or physical model to reproduce the conditions of a situation or process. A survey is an investigation of one or mo e characteristics of a population. A census is a measurement of an entire population. A sampling is a measurement of part of a population. CHAPTER 1. INTRODUCTION TO STATISTICS l-.3.L.1 Sampling Techniques Samples can be broken down into two basic types: nonprobability and protrability. In the nonprobability type, there is no way of estimating the probability that each individual or element wili be included in the sampie. In probability sampling, in the most frequently encountered situations, each individual has an equai chance of becoming a part of the sample. 1. Nonprobability Sampling The students in a class may constitute the entire sample because they happen to be in a class whose instructor is interested in doing some research. Such exampies are called accidental or incidental samples. Another type of nonprobability sampling is quota sampiing. In this type of sampling, the proportions of the various subgroups in the population are determined and the sampie is drawn (usually not random) to have same percentages in it. The third type is purposive sampling. For example, the cities in the Philippines that have voted for the winners in a series of past presidential elections could be identified. We study these cities, and from the voter's preferences \t make a prediction on \ "* the outcome of a national election. The major advantage in the use of the samples like these is that they are convenient a,nd economical. 2. Probability Sampling Simple random sampling is the basic type of probability sampiing. In this type, each individual in the population has an equal chance of being drawn into the sample. This could be done by drawing lots or by the use of ranrlom numbers. When sampling procedures are not carried out iike this, the result is said to be biased. Systematic sampling selects every kth element in the population for the sample, with the starting point to be determined at random from the first k elements. Systematic samples axe very easy to obtain and are ofben used as if they were random samples. In fact, some systematic sampies can lead to precise inferences concerning population parameters simply because the sampie values spread evenly over the entire population. However, the real danger exists if one happens to choose sampling interval that corr*' sponds to a hidden periodicity. Cluster sampling selects samples containing either all, or random selection, of the elements from clusters that have themselves been selected randomly from the population. It has the advantage of being more cost efficient when the population is wideiy scattered. When the clusters are geographic area,s, such as regions of a state, or subdivisions of a Iarge city, the sampiing procedure is called area sampling. STTO41 STATISTICAL A}{ALYflS - Stratified random sarnpling seiects simple random samples from mutually exciusive subpopulations, or strata, of the population. Here, the population is divided into strata such that the data of interest are fairly homogenous within the given stratum. Stratifi- cation of a population results in strata of various sizes. Consideration must therefore be given to the size of the random samples selected from these strata. This could be done using proportional allocation which chooses sample sizes proportional to the size of the different strata. L,4 Exercises Identify the population and the sample. Describe the sampie data set. 1. A survey of 4787 adults found that 15% use ride-halling applications. 2. Forty-two professors in Pennsylvania were surveyed concerning their opinions of the cur- rent education policy of the state. ,). A survey of 2223 adults found that 62pursue a career as a video game developer or designer. 4. A survey of 1601 children and adults ages 16 years and older found that 4812 months. 5-8 Determine whether the number describes a population parameter.or a sample statis- ------'..- tic. 5. In 2016, the National Science Foundation announced 22.7 milLion in infrastructure- strengthening investments. 6. In a survev of 1000 likely voters, 29 percent trust media fact-checking of candidates' comments. 7. In a recent study of physics majors at a university, 12 students were minoring in math. 8. Thirty percent of a sample of 521 workers say that they worry about having their ben- efits reduced. 9-12 Determine whether the data are qualitative or quantitative. CHAPTER 1. INTRODUCTION TO STATISTICS L The ages of a sample of 430 employees of a soflware company. 10. The IQ levels of the students of a secondary school 11. The revenues of the companies on the Fortune 500 list. 12. The genders of a sample of 1,000 students of a university 13-15 Identify the sampling technique used in each study. 13. A journalist asks people at a campground about air pollution. 14. For quality assurance, every tenth machine part is selected from aa assembly line and measured for accuracy. 15. A study on attitudes about smoking is conducted at a college. The students are divided by class (freshman, sophomore, junior, and senior). Then a random sample is selected from each class and interviewed. STTO41 - STAIISTICAL ANALYSIS Chapter 2z Descriptive Statistics 2.L Frequency Distributions A frequency distribution is a table that shows classes or intervals of data with a couut of the number in each class. The ftequency f of a class is the number of data points in the class. Constructing a Flequenry Distribution Guidelines: 1. Decide on the number of classes to include. The number of classes should be between 5 and 20; otherwise, it may be dfficult to detect any patterns. Find the class width as'follows. Determine the range of the data, divide the range by the number of classes, and round up to the next convenient ngmber. Find the elass limits. lbu the minimum entry as the lower limit of the first ca,n use class. To find the remaining lower limits, add the class width to the lower limit of the preceding class. Then find the upper class limits. 4. Make a tally mark for each data entry in the row of the appropriate class. 5. Count the tally marks to find the total frequency / for each class. / Constructing a Fbequency Distribution from a Data Set Example z.L.L. The data set lists the out-of-pocket prescriplio-n-medicine expenses (in dol- lars) for 30 U.S. adults in a recent year. Construct a frequency distribution that has seven classes. (Adapted, from: Health, Uni,ted States, 2015) 200 239 155 252 384 165 296 445 303 400 307 247 256 315 330 3r7 352 266 276 345 238 306 29A 271 345 312 293 195 168 342 Solution: 1. The number of classes (7) is stated in the probiem. 2. The minimum data entry is 18 and maximum entry is 54, so the range is 36. Divide the railge by the number of classes to find the class width. ranqe Class width : (number of classes) Class widrh :T r 35.71 3. The minimum data entry is a convenient lower limit for the first class. To find the lower iimits of the remaining six classes, add the class width of 36 to the lower limit of each previous class. So, the lower limits of the other classes are 155 * 36 : 191, 191 * 36 : 227, and so on. The upper limit of the first class is 190, which is one less than the lower limit of the second class. The upper limits of the other classes are 190 * 36 : 226,226 f 36 : 262, atd so on. The lower and upper limits for all seven classes are shown at the left. 4. Make a tally mark for each data entry in the appropriate class. For instance, the data entry 168 is in the 155-190 class, so make a tally mark in that class. Continue until you have made a tally mark for each of the 30 data entries. 5. The number of tallv marks for a class is the frequency of that class. The frequency distribution is shown below. The first class, 155-190, has three tally marks. So, the frequency of this class is 3. Notice that the sum of the frequencies is 30, which is the number of entries in the data set. The sum is denoted by Uf where X is the uppercase Greek letter sigma. STTO41 STATISTICAL ANALYSIS - Frequency Distribttion for Out-of-Pocket Prcscription Msdkine Expenses {in doHars} /--- ExPenses \ Numtrer of acrurts ,., ff*, , Tallr , FT11ener'"f. :; ' lss*ler) lll 3 , tel-226 , ll 2 227-262 llf{ 5 t+tfl 6 l 263*?e8 , : ?ee-334 l+tf ll 7 335-370 llll 4 Check that the sum 371--406 lll 3 " :Ilt rrequencies Ef 3o ;Jil:*,*J::*' : Midpoint The midpoint of a class is the sum of the lower and upper limits of the class divided by two. The midpoint is sometimes called the cJa.ss mark. (lower class limit)*(upper class limit) Mi,d,poi,nt: Relative Flequency The relative frequency of a class is the portion or percentage of the data that faiis in that class. To find the relative frequency of a class, divide the frequency "f by the sample size n. rf : (classf requency) f (samplesi,ze) n Note: n : Zf Example 2.L.2. Finding Midpoints, Relative Frequencies and Cumulative Fbequencies CHAPTER 2. DESCRIPTIVE STATISTICS Rehtire Crmslafive Clms T Midpoirt @rerey frcqrency t55-lq) rg+l8 = r7?.5 -i *=u.t -t r9r*?26 r:-g=2&i.5 L = o.rn 3+2=5 3t) ?27+?62=?44.5 5 22't-?62 = o.l7 5+5=10 2 30 263-298 M ]-sz/2s. 9: gl o.z l0+6=!6 299 + 334 29e-334 = 316-5 *=*o 16+7=?3 The remaining midpoints, relative frequencies, and cumulative frequencies are shown in the expanded frequency distribution belornr. FrcquencX Disrrhutfur for Out of-Poc*et prescripton ltilGdidne Expenses (ht do[e$] iarrr",@ adutts :155-190i f Yfry'TYll$o-t:ni"t , lrc.s i o.t , ^ 3 oI aoults ilgl-226; 2 , 309.5 I 0.07 : 5 , 227-26?; 5 244.5 i 0.17 r I0 26:_z9ai e : ?80.5 , 0.2 , ro i:vt-t:+: ? 31f.5 I 0.23 , as Interprctation; There are several patterns in the data set. For instance, the most common range for the experses is P299 to P334. Also, about half of the expenses are less than P299 10 STTO41 STATISTICAL ANALYSIS - 2.2 Graphs of Flequency Distributions Fkequency Histogram A fuequency histogram uses trars to represent the frequency distribution of a data set. A histogram has the foilowing properties. 1. The horizontai sca,le is quantitative and measures the data entries. 2. The vertical scale measures the frequencies of the classes. 3. Consecutivelars must touch. Because consecutive bars of a histogram must touch, bars must begin and end at class boundaries instead of class limits. Class boundaries are the numbers that separate classes wi,thout forming gaps between them. For data that are integers, subtract 0.5 from each lower iimit to find the iower ciass boundaries. To find the upper class boundaries, add 0.5 to each upper limit. The upper boundary of a ciass will equal the lower boundary of the next higher claqs. In constructing frequency histogram, first, find the class boundaries. Because the data entries are integers, subtract 0.5 from each lower limit to find the lower class boundaries and add 0.5 to each upper iimit to find the upper class boundaries. So, the lower and upper boundaries of the first class axe a,s follows. First class lower boundary: 155 - 0.5 : 154.5 First class upper boundary : 190 + 0.5 : 190.5 The boundaries of the remaining classes are shown in the table at the left. To construct the histograur, choose possible frequency values for the vertical scale. You can mark the horizontal scale either at the midpoints or at the ciass boundaries. Both histograms are shown beiow. Example 2.2.L. 0d€f-$o.*31 Pffi+aloB MeilituExlrt66 {bb€t€d e*h.rs tpu&i6l 7i ri E= !i r! ri BbkilxS Elfun+(in&hE) E-\trn* {in d({ls} CHAPTER 2. DESCRIPTIVE STATISTICS 11 Fbequency Polygon A frequency polygon is a line graph that emphasizes the continuous change in frequen- cies. To construct the frequency polygon, use the same horizontal and vertical scales that were used in the histogram labeled with class midpoints in Example 3. Then piot points that represent the midpoint and frequency of each class and.connect the points in order from left to right with line segments. Because the graph should begin and end on the horizontal axis, extend the left side to one class width before the first class midpoint and extend the right side to one class widt&q$er the last class midpoint. Our-of-P$ckrt Prescription Medicint Expenses i6 L tr.1 3r ^ ! L ) fl :rB_5 2{4-1 2e}.5 }16': 35:.: Exptn** {in dr:llars} Interprtta,ti,on: You can see that the frequency of adults increases up to an expense of $316.50 and then the frequency decreases. Relative kequency Histogram A relative frequency histogram has the same shape and the same horizontai scale as the corresponding frequency histogram. The difference is that the vertical scale measures the relative frequencies, not frequencies. The relative frequency histogram is shown. Notice that the shape of the histogram is the same as the shape of the frequency histogram constructed in Example 3. The only difference is that the vertical scale measures the reiative frequencies. tlaf