Engineering Data Analysis PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to engineering data analysis. It covers definitions of key terms like statistics, variables, populations, and samples. It also explains descriptive and inferential statistics.
Full Transcript
ENGINEERING DATA ANALYSIS The field of data analysis can be called B. Inferential Statistics or non-Parametric engineering analytics or applied mathematical Statistics - involves making inferences or statistics....
ENGINEERING DATA ANALYSIS The field of data analysis can be called B. Inferential Statistics or non-Parametric engineering analytics or applied mathematical Statistics - involves making inferences or statistics. generalizations or predictions from samples to Engineering Data Analysis – involves basic populations. statistical techniques, probability, risk analysis, - consist of generalizing from samples to and predictive modeling, and how they impact populations, performing hypothesis- engineering and manufacturing activities in testing, determining relationships among both analytical and forwardlooking activities variables and making predictions or (refer to projections, forecasting or decisions. estimations). - uses principles in probability to arrive to inferences, generalizations or predictions DEFINITION OF TERMS: from sample to population. 1) Statistics - assumes a dual meaning a. In Singular Form; Statistics is a science which deals with the collection, organization, presentation, analysis, inferences and interpretation of data. b. In plural Form; Statistics refer to measures or data obtained from studying the population or sample. 2) Statistical Variables – a characteristic/attribute/property which tend to vary from one element to another in a population or sample and can be measured in terms of categories, names or numerical values. simply refers to characteristics or properties of measurable quantities. 3) Population – any complete set of people, CLASSIFICATIONS OF STATISTICAL objects or observations (events) being studied VARIABLES (REFERS TO CHARACTERISTICS and having a characteristic. BEING STUDIED/MEASURED) 4) Sample – a representative portion or subset/ subgroup of a population. I. According to the data obtained when measured 5) Statistical Data or statistics – refers to nonnumeric (names, categories, codes) or 1.) Qualitative Variables - assumes distinct numerical values of a statistical variable. categories, according to some characteristics 6) Data set/ Data File – a collection of data or attribute. values. - variables which assume non-numeric data 7) Data value or datum or element - a Ex: Religious Affiliation, gender, educational specific value in a data set. attainment, attitude scale or Likert scale. 8) Probability - refers to the chance for ano ccurrence of an event. A tool used in 2.) Quantitative Variables – assume numerical Inferential Statistics. values and can beordered or ranked. 9) Hypothesis – a tentative statement used as an explanation of an observed event. A May be further classified as conjecture about a population parameter. This conjecture may or may not be true and According to their nature: must be tested. a.) Discrete Quantitative Variables – assume 10) Hypothesis-Testing – a decision-making values that are counted. Refers to variables process for evaluating claims about a that are countable and cannot assume all population. values between any two specific values. Ex: No. of Children in the family, No. of calls BRANCHES OR KINDS OF STATISTICS OR per day, Population in a locality. STATISTICAL STUDY b.) Continuous Quantitative Variables - A. Descriptive Statistics or Parametric assume all values between any two specific Statistics - involves describing a situation for values. They are obtained by measurement. a given population or sample. Ex: Temperature, gasoline mileage, family - consists of the collection, organization, income summarization and presentation of data. - describes the behavior of the sample or population. II. According to the measurement used to obtain its data: Levels or Scales of Measurements - refer to how the data of a variable is categorized, counted or measured. LEVELS OR SCALES OF MEASUREMENT OF STATISTICAL VARIABLES OR DATA: 1.) Nominal Scale/level - classifies data into names or categories that are mutually exclusive, exhaustive but NON-RANKABLE. - Ex: gender, nationality, plate number, civil status 2.) Ordinal level – classifies data into names or categories that are rankable although precise difference between ranks may not exist. - Ex: educational attainment, attitude scale(Likert’s Scale) , ranking system 3.) Interval Level – classifies data into numerical values and precise units of measure exist but does not assume a value of zero (zero has no meaning for the variable measured) - Ex: Grade point average, temperature, IQ 4.) Ratio Interval Level or Ratio Level – classifies data into numerical values with same characteristics as interval level except that a true zero exists or has meaning for the given variable. - Ex: no. of children, monthly income, gasoline consumption. WHY IS LEVEL OF MEASUREMENT IMPORTANT? 1.) The level of measurement helps one decide allow how to interpret the data for a given variable. When you know that a measure is nominal 2) Sample Study (like the one just described), then you know that the numerical values are just short codes for the longer names. 2.) Knowing the level of measurement helps one decide what statistical analysis is appropriate on the values that were assigned. If a measure is nominal, then you know that you would never average the data values or do a t-test on the data. SAMPLING PROCESS/PROCEDURE: - is the process of selecting units (e.g., people, It's important to recognize that there is a hierarchy organizations, objects or events) from a population implied in the level of measurement idea. At lower of interest to be included in the sample so that by levels of measurement, assumptions tend to be less studying the sample we may fairly generalize our restrictive and data analyses tend to be less results back to the population from which they sensitive. At each level up the hierarchy, the were chosen. current level includes all of the qualities of the one below it and adds something new. In general, it is NOTE: Study is conducted on the sample and not desirable to have a higher level of measurement on the population only when the population has (e.g., interval or ratio) rather than a lower one number of elements that is impossible for the (nominal or ordinal). researcher to study or cover. KINDS OF SAMPLING PROCESS/PROCEDURE 1.) Probability Sampling or Random Sampling - every element in a population is given an equal chance to be included in the sample. - applicable when a complete master list of all elements in the population is available. 2.) Non-Probability Sampling - not every element in the population is given an equal chance to be included in the sample. - applicable when no master list of all elements in the population is available. TYPES OF SAMPLING PROCESS/PROCEDURE CONDUCTING RESEARCH OR A. PROBABILITY OR RANDOM SAMPLING – STATISTICAL STUDY OR SOLVING ENGINEERING OR TECHNICAL 1) Simple Lottery or Simple Random Sampling without replacement PROBLEMS with replacement May be done or applied to Random Sampling using Table of Random numbers 1) Population Study = if resources and time will 2) Systematic Sampling – each element in the population is numbered and a sampling interval, k is chosen. The sample is chosen every kth element in the population beginning with a random start. 3) Stratified Sampling – applicable when the population is classified into different groups or strata according to some characteristics that is important to the study. A proportional allocation of the number of elements in each stratum or group is included in the sample. 4) Cluster or Area Sampling – elements in the specifically approach individuals with certain sample is chosen from intact group or clusters characteristics. This approach is often used by that is representative of the population. the media when canvassing the public for opinions and in qualitative research. 5) Multistage Sampling - elements from various levels of categories in the population is chosen 4) Panel Sampling - choice of 24 respondents - such as according to provinces, regions, cities. randomly choosing a group of people to be part of a panel that takes part in a study B. NON-PROBABILITY OR JUDGMENT several times over a period of time. For SAMPLING example, in a longitudinal survey, the same 1) Accidental Sampling panel of people might be surveyed repeatedly 2) Quota Sampling over time. 3) Purposive Sampling – choosing the sample according to criteria determined by the research 5) Snowball Sampling - commonly used in social objectives. sciences when investigating hard-to-reach 4) Panel - choice of 24 respondents groups. Existing subjects are asked to nominate further subjects known to them, so the sample increases in size like a rolling snowball. For example, when carrying out a survey of risk behaviour amongst intravenous drug users, participants may be asked to nominate other users to be interviewed. Non-Probability Sampling 1) Accidental Sampling - sometimes known as grab, convenience sampling or opportunity sampling - a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a sample population selected because it is readily available and convenient. 2) Quota Sampling - often used by market researchers Interviewers are given a quota of subjects of a specified type to attempt to recruit. For example, an interviewer might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and 10 teenage boys so that they could interview them about their television viewing. Ideally the quotas chosen would proportionally represent the characteristics of the underlying population. 3) Purposive Sampling - Also known as selective, or subjective, sampling - choosing the sample according to criteria determined by the research objectives - this technique relies on the judgement of the researcher when choosing who to ask to participate. Researchers may implicitly thus choose a “representative” sample to suit their needs, or Bias in sampling Data Management There are five important potential sources of bias that Data Management includes should be considered when selecting a sample, irrespective of the method used. Sampling bias may be Data Organization - involves meaningful introduced when: organization of data into frequency distribution or tabular form. Any pre-agreed sampling rules are deviated Data Presentation – involves use of statistical from graphs and charts. People in hard-to-reach groups are omitted u Selected individuals are replaced with others, Two Types of Data File / Set: for example if they are difficult to contact There are low response rates Raw Data or Ungrouped Data – refer to the An out-of-date list is used as the sample frame data in the original form as they are collected. (for example, if it excludes people who have recently moved to an area) Organized Data or Grouped Data – refer to the data already systematically organized into a frequency distribution. FREQUENCY DISTRIBUTION Frequency Distribution – the organization of raw/ungrouped data in table form, using classes and frequencies. Frequency – the number of values in a specific class of distribution. Types of Frequency Distribution 1) Categorical Frequency Distribution - Topic Coverage applicable for nominal and ordinal level data. Data Gathering/Collection Technique 2) Grouped Frequency Distribution - applicable Data Management- Data Organization for numeric data such as interval and ratio level data. Methods of Data Gathering/Collection: 1) Asking Questions Direct / interview form Indirect/questionnaire form (Survey) 2) Observation 3) Use of Existing data 4) Experimentation 5) Simulation Types of Questionnaires: Unstructured – the questions asked are in no particular order or arrangement for as long as all those that are needed to answer the questions posed in the study are asked. Structured – questions are arranged according to the order of the statement of the problem. Open-ended – those which can be answered in any form and length. Close-ended – those for which the researcher provides a number of possible responses to choose from. - may include alternate response questions, multiple choice questions, scaled opinionaire, attitude scale using Likert’s scale, etc. NOTE: Common questions asked for Descriptive statistics are demographic data. Guidelines for classes 1) There should be between 5 and 20 classes. 2) The class width should be an odd number. This will guarantee that the class midpoints are integers instead of decimals. 3) The classes must be mutually exclusive. This means that no data value can fall into two different classes 4) The classes must be all inclusive or exhaustive. This means that all data values must be included. 5) The classes must be continuous. There are no gaps in a frequency distribution. Classes that have no values in them must be included (unless it's the first or last class which are dropped). 6) The classes must be equal in width. The exception here is the first or last class. It is possible to have an "below..." or "... and above" class. This is often used with ages. For Grouped Frequency Distribution The Reasons for Constructing a Frequency Distribution Features of Grouped Frequency Distribution: 1) To organize the data in a meaningful, intelligible way. 1) class or class interval – specific range of values 2) To enable the reader to determine the nature whose frequency is obtained. or shape of the distribution. 3) To facilitate computational procedures for 2) Class limits – values included in a given class measures of average (central tendency) and and include spread (dispersion) of the data set. 4) To enable the researcher to draw charts and a) Lower class limit (LL) – lowest value in a graphs for data presentation. given class. 5) To enable the reader to make comparisons b) Upper class limit (UL) – highest value in a among different data set. given class. NOTE: The DISTRUBUTION is a summary of the 3) Range (R) – the difference between the highest frequency of individual values or ranges of values for a and lowest value in the data file/set. variable. This is the main concern of performing data Where: R = highest value - lowest value in the raw data 4) Class size/width/length (C) - number of units of numeric value in a given class. Where: C = UL – LL + 1 Or : C = Range / desired number of classes 5) Class boundaries – range of numerical values that separate the classes so that there are no gaps in the frequency distribution. (Basic Rule: The class should have the same decimal place value as the data, but the class boundaries should have one additional place value and end with 5) - refer to the board for illustration 6) Class Midpoint (M) - middlemost value in a given class Where: M = (UL + LL) / 2 organization. 4) The OGIVE is a graph that represents the DATA PRESENTATION cumulative frequencies for the classes in a Data Presentation: frequency distribution. usually done by representing the data as graphs or charts. Converting the frequency distribution or tabular form into statistical graphs or charts. Types of Graphical Presentation of Frequency Distribution 1) Pie Charts – a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. - applicable for categorical frequency distribution or for nominal and ordinal levels. 2) Histogram or Bar Graph - the frequency is plotted (Y-axis) against the class boundaries, (X- axis) for each class intervals. 3) Frequency Polygon – the frequency (Y axis) is plotted against the midpoints (X – axis) of each class. Data Summarization Description o usually appropriate in describing Data Analysis the center of a data set with normal or bell- shaped distribution. Summarization of Statistical Data o applicable only to variables - a method of describing the measured using ratio or interval characteristics of a sample or population based on scale or numerical values. the data obtained for a given statistical variable. Ways of Summarizing or Describing Data: 1. Measures of Central Tendency or Measures of Averages - describes the center of distribution of the gathered data. 2. Measures of Variation or Measures of Dispersion - describes how the data is spread out or scattered away from the center of the The Mean is appropriate for use in the distribution. following situation: 3. Measures of Position or Measures of Rank When the distribution consists of ratio or - describes where a specific data falls or interval data which have no extreme is located within the data set or its values (too high or too low in comparison relative position in comparison with with the other values in the data set). other data values. When other statistics or parameters (like standard deviation, coefficient of Terminologies Used correlation, etc.) are subsequently to be NOTE: computed. Measures found by using all the data values in the population are called “parameters” When the distribution is normal or is not while measures obtained by using data values of greatly skewed, the mean is usually samples are called “statistics”. preferred to either the median or the mode. In such cases, it provides a better The Measures of Central Tendency estimate of the corresponding population or Measures of Averages: parameter than either the median or the The central tendency of a distribution is an estimate of the "center" of a distribution of values. These are also single values that describe the center of the whole data points. There are three major types of estimates of central tendency: Mean Median Mode 1.) Mean or arithmetic average or computational average o the most common measure of central tendency o most sensitive to the data distribution o affected by extremely high or low scores or values in the data (skewed data). o computed by using all values in the data set. o assumes a unique value for each data set. mode. Determination of the Median (Med) Case 1: Using Ungrouped Date (Raw Data) Steps: a.) Arrange the data in an array. b.) The middlemost data is the median Case 2: Using Grouped Data (Frequency Distribution) Steps involved: 1. Obtain the “less than” cumulative frequency for each classes. 2. Determine the median class which is the class whose “less than” cumulative frequency contains n/2 or ½ (Σf). 2) Median or the counting average o the middlemost value in a given data set. o value in the data set that separates the upper 50% and lower 50% of the data. o commonly used to represent the center or middle value of the data set. o not sensitive to the presence of extremely high or low values in the data set. o used when one must determine whether the given value falls into the upper half or lower half of the distribution. o appropriate for use in both the normal and skewed distribution of 3.) Mode or Inspectional Average data. the most common value in the data set or o applicable to ordinal scale of the value/category in the data set with the measurement and numeric values highest frequency. of variables. also considered as the “most typical value” o also used to denote measure of in the data set. position or rank. applicable to variables in all levels or The Median is appropriate to use: scales or measurement (nominal, ordinal, When the distribution is grossly asymmetrical or interval and ratio) skewed or when series contains a few extremely high or a few scores compared with the rest of the mode of a data file is not always unique, a scores, the median is the most representative data file may have one or several modes or average. This is because the values of the none at all. different scores have nothing to do with the computation of the median.