Introduction to Mathematical Statistics PDF
Document Details
Uploaded by IntimateArtInformel8466
Tags
Summary
This document introduces the fundamental concepts of mathematical statistics. It covers the process of collecting, organizing, summarizing, and analyzing data to answer research questions.
Full Transcript
INTRODUCTION TO 1 MATHEMATICAL...
INTRODUCTION TO 1 MATHEMATICAL 1 STATISTICS INTRODUCTION TO STATISTICS 2. Quantitative variables takes on numerical values representing an amount or quantity Statistics = is the science of collecting, organizing, summarizing, Quantitative variables may be further classified into: and analyzing information to draw conclusions or answer Discrete variable is a quantitative variable that either a finite questions. - The information referred to the definition is the number of possible values or a countable number of possible data. According to the Merriam Webster dictionary, data are values. The terms countable means that the values result factual information used as a basis for reasoning, discussion, or from counting, such as 0, 1, 2, 3, and so on. calculation". Continuous variable is a quantitative variable that has an infinite number of possible values that are not countable. Understand the Process of Statistics 1. Identify the research objective Levels of Measurement A researcher must determine the question(s) he or she wants to Nominal Level = Identify, name, classify, or categorize be answered. The question(s) must be detailed so that it objects or events. identifies a group that is to be studied and the questions that are Example: Method of payment (cash, check, debit card, to be answered. The group to be studied is called the population credit card), Type of school (public vs. private), Eye Color Universe = set of all entities under study (Blue, Green, Brown) Population = set of all possible values of the variable. Ordinal Level = Like nominal scales, identify, name, classify, Individual = is a person or object that is a member of the or categorize, objects or events but have an additional population being studied. property of a logical or natural order to the categories or values. 2. Collect the information needed to answer the questions Example: Food Preferences, Rank of a Military officer, Social Everybody collects and uses information, much of it in numerical Economic Class (First, Middle, Lower) or statistical forms in day-to-day life. Gaining access to an entire Interval Level = Identify, have ordered values, and have the population is often difficult and expensive. In conducting additional property of equal distances or intervals between research, we typically look at a subset of the population called a scale. sample. Example: Temperature on Fahrenheit/Celsius Sample = is the subset of the universe or the population. Thermometer, Trait anxiety (e.g., high anxious vs. low anxious), IQ (e.g., high IQ vs. average IQ vs. low IQ) 3. Organize and summarize the information Ratio Level = Identify, order, represent equal distances This step in the process is referred to as descriptive statistics. between scores values, and have an absolute zero point. Descriptive statistics describe the information collected through Example: numerical measurements, charts, graphs, and tables. The main Height, Weight, Number of words correctly spelled purpose of descriptive statistics is to provide an overview of the information collected. Data collection = process of gathering and measuring information on variables of interest, in an established systematic 4. Draw conclusion from the information. fashion that enables one to answer stated research questions, In this step the information collected from the sample is test hypotheses, and evaluate outcomes. generalized to the population. This process is referred to as Inferential statistics. Inferential statistics uses methods that Consequences from Improperly Collected Data takes results obtained from a sample, extends them to the Inability to answer research questions accurately population, and measures the reliability of the result. Inability to repeat and validate the study Distorted findings resulting in wasted resources Consequences from Improperly Collected Data Misleading other researchers to pursue fruitless avenues of investigation Compromising decisions for public policy Causing harm to human participants and animal subjects Steps in Data Gathering 1. Set the objectives for collecting data 2. Determine the data needed based on the set objectives. If the entire population is studied, then inferential statistics is 3. Determine the method to be used in data gathering and not necessary, because descriptive statistics will provide all the define the comprehensive data collection points. information that we need regarding the population. 4. Design data gathering forms to be used. 5. Collect data. Variables are the characteristics that differentiate every individual within the population/sample. Sources of Data Classification of Variables Primary Data = Include information collected and processed 1. Qualitative variables = is variable that yields categorical directly by the researchers responses. It is a word or a code that represents a class or The primary data can be collected by the following category. methods. KARYL PERLAS | BAPOS 2B INTRODUCTION TO 2 MATHEMATICAL 2 STATISTICS 1. Direct personal interviews. The researcher has direct Secondary Data = Information that has already been contact with the interviewee. The researcher gathers collected, processed and reported out by another information by asking questions to the interviewee. researcher/entity 2. Indirect/Questionnaire Method. This methods of data Method of Collecting Secondary Data collection involve sourcing and accessing existing data Published report on newspaper and periodicals that were originally collected for the purpose of the Financial Data reported in annual reports study. Records maintained by the institution 3. Focus group = a group interview of approximately six to Internal reports of the government departments twelve people who share similar characteristics or Information from official publications common interests. A facilitator guides the group based Reminders: on a predetermined set of topics. Always investigate the validity and reliability of the data by 4. Experiment = a method of collecting data where there examining the collection method employed by your source. is direct human intervention on the conditions that may Do not use inappropriate data for your research. affect the values of the variable of interest. Bear in mind that the experimental method has FREQUENCY DISTRIBUTIONS AND GRAPHS several limitations that you should be aware of. Ethical, moral, and legal Concerns Organizing Data Unrealistic Controlled Environments = Raw Data - called when data are collected in original form Inability to Control for All Variables = When the raw data is organized into a Frequency Distribution, 5. Observation = is a method of collecting data on the the frequency will be the number of values in a specific class of phenomenon of interest by recording the observations the distribution. made about the phenomenon as it actually happens. Frequency Distribution Key Design Principles of a Good Questionnaire = A frequency distribution is the organizing of raw data in table 1. Keep the questionnaire as short as possible. form, using classes and frequencies. 2. Decide on the type of questionnaire (Open Ended or Closed Ended). Three Types of Frequency Distributions 3. Write the questions properly. 4. Avoid questions that prompt or motivate the respondent to Categorial Frequency Distributions = used for data that can say what you would like to hear. be placed in specific categories, such as nominal or ordinal 5. Order the questions appropriately level data. Reminders: The most important step in writing a questionnaire Examples: political affiliation, religious affiliation, blood type, is to decide what you want to find out etc. Example: Blood Type Frequency Distribution Two surveys were taken in late 1993/early 1994 about Elvis Presley. One survey asked: "In the past few years, there have been a lot of rumors and stories about whether Elvis Presley is really dead. How do you feel about this? Do you think there is any possibility that these rumors are true and that Elvis Presley is still alive, or don't you think so?" Ungrouped Frequency Distributions = They can be used for Second survey asked: data that can be enumerated and when the range of values "A recent television show examined various theories about Elvis in the data set is not large. Presley's death. Do you think it is possible that Elvis is alive or Examples: number of miles your instructors have to travel from not?" home to campus, number of girls in a 4-child family, etc. 8% of the respondents to the first question said it is possible that Number of Miles Traveled Elvis is still alive. 16% of respondents to the second question said it is possible that Elvis is still alive. Open-Ended vs. Closed-Ended Questionnaire Grouped Frequency Distributions = They can be used when the range of values in the data set is very large. The data must be grouped into classes that are more than one unit in width. Examples: the life of boat batteries in hours Lifetimes of Boat Batteries KARYL PERLAS | BAPOS 2B INTRODUCTION TO 3 MATHEMATICAL 3 STATISTICS Terms Associated with a Grouped Frequency Distribution Add the width 5+3 = 8, 8+3 = 11, 11+3 = 14, 14+3 = 17, 17+3 = 20 Class limits = represent the smallest and largest data values Hence the lower limits are 5, 8, 11, 14, 17, 20. that can be included in a class 6. Upper class limits: second lower limit class minus 1. In the lifetimes of boat batteries example, the values 24 and 8 - 1 = 7 then add 3. 30 of the first class are the class limits. The lower class limit 7+3 = 10, 10+3 = 13, 13+3 = 16, 16+3 = 19, 19+3 = 22 is 24 and the upper class limit is 30. Hence the upper limits are 7,10,13,16,19,22. Class boundaries = used to separate the classes so that there 7. Class boundaries: LL-0.5, UL + 0.5 are no gaps in the frequency distribution. 8. Tally the data and write the numerical values under frequency. Class width = difference between the Upper class limit and Then find the cumulative frequency. the Lower class limit of a class interval. Histograms, Frequency Polygons, and Ogives Guidelines for Constructing a Frequency Distribution Histogram = graph that displays the data by using vertical There should be between 5 and 20 classes bars of various heights to represent the frequencies. The class width should be an odd number. The classes must be mutually exclusive. The classes must be continuous. The classes must be exhaustive. The classes must be equal in width. Procedure for Constructing a Grouped Frequency Distribution 1. Find the highest value and the lowest value. 2. Find the Range R. Frequency Polygon = graph that displays the data by using 3. Select the number of classes k desired. (determines the lines that connect points plotted for frequencies at the number of columns in fdt) midpoint of classes. The frequencies represent the heights 4. Find the width by dividing the range by the number of of the midpoints. classes, and round up. (should be an odd number; determines how many numbers are there in class limits) 5. Select a starting point (usually the lowest value); add the width to get the lower limits. 6. Find the upper class limits; second lower limit class minus 1 then add width. 7. Find the boundaries. LL-0.5, UL + 0.5 8. Tally the data, find the frequencies, and find the cumulative frequency (frequency of the least class plus next frequency up until the end). Example In a survey of 20 patients who smoked, the following data were Cumulative Frequency Graph or Ogives = graph that obtained. Each value represents the number of cigarettes the represents the cumulative frequencies for the classes in a patient smoked per day. Construct a frequency distribution using frequency distribution. 6 classes. Other Types of Graphs Pareto charts = used to represent a frequency distribution for a categorical variable. Steps: 1. Highest value = 22 Lowest value = 5 2. Range R = Highest value - Lowest value R = 22 - 5 = 17 3. Number of classes k = 6 4. Class width c = R / k c = 17/6 = 2.83 round up c = 3 Time Series Graph = represents data that occur over a 5. Starting point is the lowest value = 5 specific period of time. Lower limits: Add the width KARYL PERLAS | BAPOS 2B INTRODUCTION TO 4 MATHEMATICAL 4 STATISTICS Measures of Central Tendency = To describe a whole set of data with a single value that represents the middle or centre of its distribution is the purpose of measure of central tendency (measures of centre or central Pie Graph = is a circle that is divided into sections or wedges location). according to the percentage of frequencies in each category = most representative or typical of all values in a group "average" of the distribution = it is a way to describe the center of a data set. = It lets us know what is normal or 'average' for a set of data. = It also condenses the data set down to one representative value, which is useful when you are working with large amounts of data. MATHEMATICS AS A TOOL: DATA MANAGEMENT & MEASURES OF CENTRAL TENDENCY Mean Population Ungrouped data 𝑠𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 = defined as groups of people, animals, places, things or ideas to Mean = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 which any conclusions based on characteristics of a sample will be applied. ∑𝑥 = The population is a complete set 𝑥̅ = 𝑁 = Reports are a true representation of opinion. Where = It contains all members of a specified group. 𝑥̅ = 𝑚𝑒𝑎𝑛 = The measurable quality is called a parameter - a numerical 𝑥 = 𝑑𝑎𝑡𝑎 measure that describes a characteristic of the population 𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 Eg. - In 2010, the population of the town numbered about 5000. Grouped data - This city has a population of more than 1000000. ❖ Long/Midpoint Method - There are 305 doctors in this hospital. ∑ 𝑓𝑥 𝑥̅ = - The number of students enrolled at the college is 15000. 𝑁 Where Sample ̅ = 𝒎𝒆𝒂𝒏 𝒙 = a subgroup of the population. 𝒙 = 𝒅𝒂𝒕𝒂 / 𝒎𝒊𝒅𝒅𝒍𝒆 𝒑𝒐𝒊𝒏𝒕 = (Ll+Ul)/2 or add interval after = The sample is a subset of the population. getting the first middle point = Reports have a margin of error and confidence Interval. 𝑵 = 𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒅𝒂𝒕𝒂 / 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 = It is a subset that represents the entire population. 𝒇 = 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 = The measurable quality is called a statistic - numerical measure fx = (f)(x) that is used to describe a characteristic of a sample. Step 1: Get the middle point - According to a survey of 606 city residents, garbage collection Step 2: Multiply the frequency and data. was the city service people liked most. Step 3: Get the summation of frequency (∑𝑓) and summation of - A survey of 2000 federation members had shown that 48% the product of frequency and data (∑𝑓𝑥). believed police should have the right to take industrial action ❖ Coded Deviation Method Because a parameter is found out only when you know data ∑ 𝑓𝑑 𝑥̅ = 𝐴𝑀 + [ ]𝑖 about everyone in the population, it's fixed. In contrast, a 𝑛 statistic that describes the same population can vary. This is 𝑥̅ = 𝑚𝑒𝑎𝑛 𝑙𝑙+𝑢𝑙 because you can take different samples from the same 𝐴𝑀 = 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛 = of the classes of the midpoint 2 population and thus get different results. So, a parameter is 𝑑 = 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 obviously more reliable than a statistic. Still, when a population = positive – higher classes; negative – lower classes / kung is so large that nobody has the time or the resources to ask descending yung classes, descending yung d everyone, a statistic will provide enough information to draw conclusions. Midpoint = n/2 then hanapin sa cf ang lowest value na pasok ang midpoint = average or norm = most stable measure = affected by extreme values = may not exist as a data point in the set KARYL PERLAS | BAPOS 2B INTRODUCTION TO 5 MATHEMATICAL 5 STATISTICS = definition is the sum of all the values in the observation or a 4. 𝚫𝟐 = highest frequency minus the frequency after the modal dataset divided by the total number of observations. This is also class known as the arithmetic average. = can be used for both continuous and discrete numeric data as well as for categorical data, as the values cannot be summed. = most frequent data point = As the mean includes every value in the distribution the mean = exists as a data point is influenced by outliers (which are numbers that are much = unaffected by extreme values higher or much lower than the rest of the data set) and skewed = useful for qualitative data (asymmetric) distributions. = may have more than 1 value = This measurement is applicable to use for ratio and interval = can be found for both numerical and categorical (non- data. numerical) data. It is the most commonly occurring value in a distribution. = There can be more than one mode for the same distribution of data, (bi-modal, or multi-modal), thus limiting the ability of the mode in describing the center of the distribution. Median = In some particular cases, the distribution may have no mode at Ungrouped data = Arrange the given either from ascending to all (i.e. if all values are different). In such case, it may be better descending order, then find the one in the center. to consider using the median or mean, or group the data in to appropriate intervals, and find the modal class. Grouped data Eg. 𝒏 − 𝒄𝒇 Count how many of each value appears. The mode is the value ̃ = 𝒍𝒃𝒎 + (𝟐 Median = 𝒙 )𝒊 𝒇𝒎 that appears the most. You can have more than one mode. Where: 2, 2, 3, 5, 5, 7, 8 = The modes are 2 and 5 𝒍𝒃𝒎 = lower boundary of the median class = ll of the media -.5 < 𝒄𝒇 = less than cumulative frequency before the median class GET THE MEAN, MEDIAN, MODE 𝒇𝒎 = frequency of the median class 𝒊 = class interval Classes f 𝒏 = total number of data 30-34 3 35-39 27 = value that divides ranked data points into halves: 50% larger 40-44 18 than it, 50% smaller 45-49 25 = may not exist as a data point in the set 50-54 12 = influenced by position of items, but not their values 55-59 30 = considered as the physical middle point in a distribution 60-64 8 because it is located at the center position when the values are 65-69 7 arranged in ascending or descending order, which in turn divides i=5 n = 130 the distribution in half (there are 50% of observations on either side of the median value). ❖ MEAN = If a distribution has an odd number of observations, the median Classes f x fx value is the middle value. If it is an even number, the median 30-34 3 2 96 value is the mean or average of the two middle values. 35-39 27 37 999 Eg. 2, 2, 3, 5, 5, 7, 8 = The median is 5 40-44 18 42 756 45-49 25 47 1175 Mode 50-54 12 52 624 55-59 30 57 1710 Ungrouped data = Simply find the most recurring value 60-64 8 62 496 65-69 7 67 469 Grouped data ∑ 𝑓𝑥 = 6,325 𝚫𝟏 i=5 n = 130 ̂ = 𝒍𝒃𝒎 + ( Mode = 𝒙 )𝒊 𝚫𝟏+ 𝚫𝟐 Where: ∑ 𝑓𝑥 6,325 𝑥̅ = = = 48.65 𝑳𝒎𝒐 = lower boundary of the modal class with highest 𝑁 130 frequency Classes f x