EDA 1st Midterm Reviewer PDF
Document Details
Uploaded by LovingGreen
Tags
Summary
This document is a reviewer for a midterm exam in Engineering Data Analysis. It touches on inferential statistics, including sampling methods, probability, and descriptive statistics with emphasis on organizing and displaying data through tables and graphs. Important terminologies such as population and sample are also included. Types of data, probability and data calculations are also part of the review.
Full Transcript
ENGINEERING DATA ANALYSIS INFERENTIAL STATISTICS INTRODUCTION TO DATA TERMINOLOGIES: - methods of making decisions or prediction about population 1. Population (N) based o...
ENGINEERING DATA ANALYSIS INFERENTIAL STATISTICS INTRODUCTION TO DATA TERMINOLOGIES: - methods of making decisions or prediction about population 1. Population (N) based on sample data - Collection of persons, things, or - drawing conclusions from good object under study data 2. Sample (n) STATISTICAL INFERENCE - Portion taken from population - uses probability to determine 3. Sampling how confident we can that our - select a portion (or subset)of the conclusions are true larger population and study that PROBABILITY portion(sample) to gain information about the population - mathematical tool used to study randomness deals with 4. Parameter the chance of an event - Property of the population occurring - quantitative description of 5. Statistics chances associated with various - Estimate a population outcomes parameter - Number that represents a PROBABILITY CALCULATIONS property of sample - used to analyze and interpret VARIABLES AND DATA data - provides bridge between 1. Data descriptive and inferential - Actual values of the variable statictics - May be numbers or words - Datum- single value 2. Variables - Characteristic of interest for each person or thing in population 1. Numerical Value 2. Categorical Value A variable is the characteristic of the individual being observed or measured TYPES OF DATA MEAN AND PROPORTION 1. Qualitative 1. Mean – average 2. Quantitative 2. Proportion – distriubution - Discrete - Continuous DESCRIPTIVE STATISTICS - Numerical measures of Data - Organizing and summarizing data 2 ways to summarize: 1. Graphing (bar graph) 2. Using numbers (table) QUALITATIVE(Categorical) How often each value has - are the result of categorizing or occurred describing attributes of a population. It measures, a “How often” can be measured in 3 quality or characteristic on each ways: experimental unit. They are Frequency generally described by words or Relative Frequency = letters Frequency/n Percentage = Relative Examples: Frequency x 100 Hair color (black, brown, blonde…) Brand of cars (Dodge, Honda, Ford) Gener (Male, Female) QUANTITATIVE(Numerical) - are always numbers. They are Colors Tally f f/n % the result of counting or Blue 6 6 6/25=0.24 24% measuring attributes of a Red 3 3 3/25=0.12 12% Yellow 4 4 0.16 16% population. Orange 5 5 0.20 20% Quantitative Discrete Data – are Green 4 4 0.16 16% the result of counting Brown 3 3 0.12 12% Quantitative Continuous Data – are the result of measuring GRAPHS Example: - More helpful in understanding 1. For each construction project, the data the number of laborers is - No strict rules concerning measured. -Quantitative which graphs to use Discrete Two graphs that are used to display 2. Time until a light bulb burns data: out Quantitative Continuous For 1. pie charts a particular day, the number of cars entering a college campus 2. bar graphs is measured. Quantitative Discrete 3. For a particular day, the amount of gas consumed by a karatig jeep. Quantitative Continuous ORGANIZING AND DISPLAYING DATA 1. Statistical Table PIE CHART 2. Graph - Categories are represented by wedges in a circle and are STATISTICAL TABLE: proportional in size to the Use a data distribution to describe: percent of individuals in each What values of the variable category have been measured BAR GRAPH 2. Probability Sampling - The length of the bar for each - Simple Random Sampling category is proportional to the - Systematic Random Sampling number or percent individuals in - Cluster Sampling each category. - Stratified Sampling - Bars may be vertical or horizontal NON-PROBABILITY SAMPLING - individuals of the population are POPULATION VS. SAMPLE not given an equal opportunity of becoming a part of the Population: sample. μ=mean σ=standard deviation 1. Convenience Sampling - Choosing samples based on easy The average weight of all the or convenient access students in Bulsu Laboratory High School μ=61kgs 2. Quota Sampling Sample: - Choosing samples to fill a specific quota X̄=mean - They are chosen according to s=standard deviation traits or qualities The average weight of the students in Grade 7 of BulSU Laboratory Highschool 3. Judgemental Sampling X̄=61.5kgs - Purposive sampling or authoritative sampling CENSUS - Sample members are chosen - information/data gathered from only on the basis of the every member of the population researcher’s knowledge and judgement DATA - information from the sample of 4. Snowball Sampling the population. - Sample chosen provide referrals to recruit samples required for a research study SAMPLING PLAN/METHOD PROBABILITY SAMPLING - selecting the group where the - Members of the population has a researcher will collect data pre-specified and an equal from. chance to be part of the sample 1. Simple Random Sampling SAMPLING METHODS - Choosing representative by rolling a die for instance or 1. Non-Probability Sampling using a number generator - Convenience Sampling - Judgemental Sampling 2. Systematic Sampling - Quota Sampling - Choosing a representative using a - Snowball Sampling regular interval, say every "r- th” individual to be a part of the sample. 3. Cluster Sampling Disadvantages: - Ideal for extremely large Time-consuming populations and/or populations Expensive distributed over a large Can be biased based upon the geographic area. attitude of appearance of the 4. Stratified Sampling surveyor - Choosing members of a sample 2. Self-administered Survey when there are clearly defined Advantages: subgroups in the population you Respondent can complete on his or are studying. her free time - Formula: Less expensive than face-to-face # of members in each strata = #of interviews members in the stratum/Total # of Anonymity causes more honest members in population (Desired results Sample size) Disadvantage: 1. Lower response rate DESIGNING A SURVEY 2. When designing a survey, the following steps are useful. 3. 1. Determine the goal of your survey: What problem do you want to have 4. an answer? What variables do you want to 5. answer? 2. Identify the sample population: Whom will you interview? 3. Choose an interviewing method: face-to-face interview, phone interview, self-administered paper survey, or internet survey. 4. Decide what questions you will ask, CONDUCTING A SURVEY in what order, and how to phrase them. (This is 1. Face-to-face interviews important if there is more Advantages: than one piece of Fewer misunderstandings information you are looking High response rate for.) Additional information can be collected from respondents 5. Conduct the interview and collect - Can perform mathematical the information. computations: Frequencies and proportions, sometimes means A university campus director wants to - Differences cannot be measured construct a survey that shows how - Can be represented through bar many hours chart. per week an Engineering student works at the university. List the goal of the survey. What population sample will he interview? How would he administer the survey? LEVELS OF MEASUREMENT INTERVAL SCALE LEVEL Data can be classified into four levels of - The order matters measurement. - Differences can be measured. 1. Nominal scale level - Zero is arbitrary. 2. Ordinal scale level Arbitrary-depending upon 3. Interval scale level choice or discretion 4. Ratio scale level NOMINAL SCALE LEVEL - Qualitative/Categorical - Names, colors, Labels, Gender, etc - Order does not matter - Cannot perform mathematical RATIO SCALE LEVEL computations, frequencies and - Order matters proportions can be applied - Differences are measurable - Can be represented through bar (including ratios) chart and pie chart - Contains a "0" starting point 10 individuals are interviewed about their favorite color. Color Numerical Frequency Percentage Description Red 1 2 20% Blue 2 3 30% Green 3 5 50% ORDINAL SCALE LEVEL - Attributes can be rank ordered - Ranking and placement - Ordered attributes. The order matters COLLECTION OF DATA Collection of data Steps in constructing A Frequency -is the first step in the field of Distribution research. Presentation of data 1. Determine the largest and - the process to condense and smallest value in the data arrange the data and to study 2. Determine the number of their characteristics once the class interval (k) desired collection process is complete Ungrouped data Recommended values From Joan and Gyms - data in its original form which the researcher first collects from research. - Ungrouped data or raw data is a mere list of numbers that does not convey anything. This is because no summarization or aggregation is possible. Stuggs offer a mathematical Formula: Grouped data - refers to the data which is bundled together in different classes or categories. Or as mentioned before, you may use K= Given the data, determine the ff: √n and then round to the nearest whole a. Frequency table number, If necessary b. Relative frequency c. Cumulative frequency 3. Determine the approximate class size (C) [ class size is d. Cumulative Relative frequency also known as bin size or 3, 2, 4, 5, 6, 8, 2, 5, 8, 7, 9, 8, 8, 8, 11, 10, class width] 12, 11,9 4. Determine the lower and the definition to fit within the upper limits of the class restrictions of a frequency table Interval - Since we do not know the - Individual data values. We can Instead find the midpoint (xi) or class mark of each interval 5. Write down the class intervals Starting with the decided lower-and upper- class limit of the first class interval. Add the class size to Relative Frequency Histograms the lower- and upper- class - for a quantitave data set is a limits to obtain the next class bar in which the height of the Interval and so on. bar shows "how often" (measured as proportion of 6. Determine the number of relative Frequency) observations falling under measurements fall in a each class Interval. particular class or subinterval Class Boundaries the class limit. This is necessary so that NO values can be observed exactly on a boundary Class mark or midpoint, xi - When only grouped data is available, the Individual data values are not known ( only intervals and interval frequencies) are known) therefore, there is no way - compute an exact mean, median and mode for the data set. We simply need to modify STATISTICS IN DATA ANALYSIS Characteristics: Why study statistics in Data Analysis? a. It always exists. It can be calculated for any It is used for making generalizations, predictions set of numerical dala. and decisions b. It is unique. A set of numerical data has only A garment factory made a survey about the color of one mean. T-shirt that people like to wear. The information is listed below. c. It can be combined with other data sets. It COLOR FREQUENCY leads itself to further statistical manipulation. White 427 Red 300 d. It is reliable for inference making. The mean Blue 405 of many samples drawn from the same Brown 310 population generally does not vary or Gray 450 fluctuate. Generalization: The most popular color of t-shirt is e. It takes into account every data point. It may gray be affected by extreme values. Prediction: Since the most popular color of T-shirt is gray then as a t-shirt maker you can predict that the gray T-shirt are easy to sell. STATISTICAL MEASURES THE LAW OF LARGE NUMBERS AND THE Measures of Center MEAN Mean Median The Law of Large Numbers says that if you take Mode samples of larger and larger size from any population, then the mean x of the sample is Measures of Spread very likely to get closer and closer to population Range mean. Average Deviation Variance MEDIAN is a measure of central tendency which Standard Deviation divides the data into two equal parts. It corresponds to the value of the middle item (or Measures of Symmetry the average of the values of the two middle items) Skewness when the data are arranged in an increasing or decreasing order of magnitude. Measures of Position Percentile, Quartile, Decile Characteristics: Z-score a. It always exists. b. It is unique. MEAURES OF THE CENTER OF THE DATA c. It is not easily affected by extreme values. (UNGROUPED DATA) Measure of central tendency or measure of central location is a single number that gives a summary of the characteristics of a given set of data. The most commonly used measures of central tendency are: mean (arithmetic average) MODE is a measure of central tendency which median occurs as the most frequently observed value of mode the variable. It is the value that occurs with highest frequency. Such measures can be computed in two ways: ungrouped data form and grouped data. Characteristics: a. It requires no calculations (in case of ARITHMETIC MEAN (mean or average) average ungrouped data) of the measurements; described as the center of gravity. b. It is applicable to both qualitative and quantitative data. c. It may not exist. This happens if all the values are observed with the same frequency. d. If it exists, it may not be unique. 1. A meeting is attended by four Executives aged 43, 41, 45 and 41 and the owner/President of the company who is 60 years old. The secretary noted that the average age of those attending the meeting is 46 yrs old. Verify if the statement of the secretary is correct. Suppose brand D is out of stock. Which 2. Suppose the Vice President of the company brand of battery will you buy? who is 45 yrs old attended the meeting. What would be the mean and median age of the MEASURES OF POSITION attendees? QUARTILES: Divides the data into 4 equal 3. Suppose the vice president of the company parts who is 45 yrs old attended the meeting. What PERCENTILES: Divides the data into 100 would be the mean and median age of the equal parts attendees? DECILES: Divides the data into 10 equal parts 4. Mode PERCENTAGE VS PERCENTILE COMPARING THE MEAN, MEDIAN AND MODE 1, 2, 3, 4, 5 Set 1:Recall the data ages of those who initially Percentage=(nos.meeting the attended the meeting. characteristics of interest/Total nos.of data)*100 Set 2: Suppose the age of the President is not included but the age of the Vice President is included. PERCENTILE is a value below which a certain percentage of observations lie Set 3: Assume that instead of the President, his son who is 20 years old presided over the meeting as training requirement to handle the company in the future. MEASURES OF SPREAD (UNGROUPED DATA) Range - in statistics, it is the spread of data from lowest to the highest value in the distribution. It is the simplest measure of variability. Average Deviation - a statistical tool that provides the average of different variations from a data set. The purpose is to measure the distance of a deviation from the data set's mean median. Variance Standard Deviation - is the measure of the amount of variation or dispersion of a set of values. - A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.