Course Information and Introduction PDF
Document Details
null
P. Nyenje (MAK)
Tags
Summary
This document introduces the fundamental concepts of statistics, including descriptive and inferential statistics, and their roles in engineering research. It provides case studies and examples, such as the effect of additives on concrete strength.
Full Transcript
1.0 Introduction Statistics is the study of the methods for: (1) Organizing and summarizing data, and (2) Inferring conclusions about the population on the basis of a data sample. The practice of statistics utilizes data from some population in order to describe it meaningfully, to draw conclusion...
1.0 Introduction Statistics is the study of the methods for: (1) Organizing and summarizing data, and (2) Inferring conclusions about the population on the basis of a data sample. The practice of statistics utilizes data from some population in order to describe it meaningfully, to draw conclusions from it, and make informed decisions. There are 2 branches of statistics: 1 Descriptive statistics: Involves organizing, summarizing and description of data sets. 2 Inferential statistics: Is the process of drawing conclusions about the entire population based on by results obtained from a small sample of that population Population Sample Organizing/summarizing Inferring conclusions data = Descriptive statistics about the population = INFERENTIAL STATISTICS e.g. probability distributions, testing hypothesis, estimation theory, regression analysis 1.1 -Role of Statistics in Engineering Research 4 P. Nyenje (MAK) MEC7116 Statistics is an important tool used by researchers in many fields of research to organize, analyze, and summarize data to draw meaningful information. It is helpful to put statistics in the context of a general process of engineering research: 1. Identify a question or problem. 2. Collect relevant data on the topic. 3. Analyze the data – describe and make inferences. Statistics 4. Form a conclusion. Any engineering research always begins with a research question about the underlying processes in a system such as: What is the sulfate concentration one might expect in rainfall? How variable is hydraulic conductivity in a given aquifer? What is the 100 year flood in River Rwizi? Is non-revenue water in Kampala water supply due to poor management? What is the effect of the strength of concrete when recycled concrete is used in the mix To understand these processes, researchers collect data (through field observations or through experiments/trials), analyze this data to draw conclusions and relate findings. However, most natural systems are so complex and usually evolve in time and space implying that the processes describing such systems are uncertain or stochastic. A stochastic process means that the data describing a process has a random variability and may not be predictable (or deterministic). Statistics plays an important role in understanding the underlying random variability or uncertainty in datasets describing a process and to draw reliable conclusions about underlying processes or phenomena by: - Describing the data and summarizing it. - Making inferences about the population from a sample. This requires collecting large sets of data for a given research problem and to draw conclusions from these data. Statistical analyses are also often used to communicate research findings and to support hypotheses and they give credibility to research methodology and conclusions. 5 P. Nyenje (MAK) MEC7116 Case study 1: Concrete additives This case study introduces a classic challenge in statistics of evaluating the effect of adding an additive on increasing the strength of concrete. Additives are usually added to water-cement mixture to increase the life, strength and hardening of concrete. The principle question would be: Does adding additive X increase the strength of concrete? Or the researcher will formulate a hypothesis that: Adding additive X does increase the strength of concrete The researcher carried out an experiment in which 25 concrete batches were made. Each of the 25 batches of concrete was divided into half and additive X added to one half of the batch. The other batch, nothing was added. The resulting compressive strength carried out for the concrete batches can be presented in the table below Data matrix: Batch No. 1 2 3 4 5 6 7 8… 25 Treated Untreated We can compute summary statistics for each group which can be used to see differences in the groups. In this table, we can quickly see if treated samples are stronger than untreated samples using descriptive statistics Descriptive statistics: Treated Average 1, Standard deviation 1 Untreated Average 2, Standard deviation 2 We can use descriptive statistics to generalize the findings to the entire population (i.e. we infer the results > Inferential statistics). For example, how significantly different are the means of the two groups of datasets? 6 P. Nyenje (MAK) MEC7116 1.2 Misuse and abuse of statistics Statistical tools may be misused either accidentally due to lack of knowledge or intentionally to achieve desired results. These may completely invalidate a research study. Abuses may occur in various ways but the most common abuses in engineering are incorrect application of statistical tests and the lack of appropriate treatment of data before analysis. The later abuse is most common and is related to the characteristics of data, which should be recognized before analysis. These mainly include: a. A lower bound of zero where no negative values are possible. b. Presence of 'outliers (observations considerably higher or lower than most of the data). Most outliers occur on the high side. An analysis based on average values can be affected with one extreme outlier c. Skewness in data dataset. Positive skeweness is expected in datasets that have outliers. When you plot the frequency of data, the frequency distribution will be skewed to the right. Here, the median yields more information from the dataset than the mean. d. Non-normal distribution of data due to (a), (b) and (c) above. Many statistical tests assume that data follows a normal distribution. E.g. most hydrological data is expected to follow a normal distribution. e. Data reported only as below or above some threshold. Examples include a drought report as either severe or mild. f. Seasonal patterns. Values tend to be higher or lower in certain seasons of the year. 1.3 Statistical terminologies and Data basics a. Population (N) is the whole collection of values under consideration. It may be finite or infinite. b. Sample (n): Is the subset of data that represents the population. c. Variables: A physical entity for which the value can vary e.g. depth of rainfall, infiltration, evaporation, strength of concrete etc. d. A random variable: Is a variable whose value is subject to variations due to chance. It can take on any set of possible different values. 7 P. Nyenje (MAK) MEC7116 e. Categorical variable: Are variables that are qualitative in nature e.g. ordinal and nominal variables. They take on a limited number of distinct categories. Categories can either be labels or names. They can also be numbers but it is not possible to perform arithmetic operations on them. f. Numerical variable: A variable that can take on a wide range of numerical values. It is possible to do arithmetic operations on them. They can either be discrete or continuous. g. A continuous variable: A variable whose values can be measured along a continuous scale. e.g. discharge hydrograph h. Discrete variable: Is a variable whose values are countable e.g. Number of rain days in a week, number of floods in a specific period etc. i. Univariate or multivariate variables: A variable is called ‘univariate’ if the observation is for a single quantity. Multivariate or vector valued variables have observations of more than one quantity (e.g. EC, pH, Na, Mg). X = [X1, X2, …]. j. Data: The observations in the sample. Data forms the basis of statistical investigations. Data may either be experimental or historical. Experimental data are measured through experiments and can usually be obtained repeatedly by experiments. Historical data, however, are collected from natural phenomena that can be observed only once. Most hydrological data are historical data and consist of a set of values of a variable, say the maximum seasonal floods observed during 30 years. k. A series: A sequence of values arranged in order of their occurrence in time or in space. A continuous series in one where observation are recorded continuously in time or space. Most observations/measurements collected in hydrology are in form of time series. Discharge t 8 P. Nyenje (MAK) MEC7116 l. Independent or dependent variable An independent variable is the variable that influences the outcome measure. A dependent variable is the variable that is dependent on or influenced by the independent variable(s). A dependent variable may also be the variable you are trying to predict. E.g. A researcher wants to study the effect of climate change on river runoff. The independent variable is climate change The dependent variable is river runoff. Dependent variables: Is a pair of variables that is related in some way. There is an association between one variable and the other. Distinguish between paired datasets vs independent datasets? m. Explanatory and response variables Explanatory variable is a type of variable suspected to affect another variable. It is a type of independent variable. E.g. If we suspect that the quality of water affects concrete strength, then water quality is an explanatory variable and concrete strength is the response. It is possible to have many explanatory variables. Note: Association does not mean causation. Two variables may be associated or correlated but without a causal relationship. A casual relation occurs if the occurrence of one variable causes affects the other (usually determined from experiments when other factors are kept constant). An association can only be inferred where the data is observation type of data. 9 P. Nyenje (MAK) MEC7116 1.4 Levels of data measurements Before conducting a statistical analysis, it is important to measure our variable. The way the dependent variable is measured depends on the type of variable. For example, for time, you need a stop watch. For performance, you might get responses of a variable as poor, fair, good etc. For type of pit latrine, you might get responses such as ordinary pit latrine, VIP, traditional pit latrine etc. Therefore the values of responses of a variable can be numeric or qualitative. There are four different levels of data measurement. (a) Nominal Here, you simply categorize the responses of a variable you are researching. E.g. Type of toilet, type of weather stations, gender etc. There is no ordering of responses but simply categorizing them. E.g. gender is either male or female and the order of these categories does not matter. This is the lowest level of measurement (b) Ordinal level Here, responses about a variable are ordered E.g. for performance, one can respond as poor, fair, good, very good or excellent. Hence the order matters. From poor to excellent is a measure of increasing performance. Differences between responses, however, are not necessarily the same. E.g. Difference between poor and fair may not have the same meaning as difference between very good and excellent. Other examples are percentile ranks. (c) Interval level The responses of a dependent variable are numerical values, in which the intervals between them have the same meaning. E.g. Difference between 4 centigrade degrees and 6 degrees has the same meaning as the difference between 20 degrees and 22 degrees. The 2 degrees centigrade has the same physical meaning in terms of temperature Interval data do not have a true zero point. E.g. A temperature of 0 degree does not mean that there is no temperature. Temperature exists but it is zero. (d) Ratio scale This is the most informative scale. It is the same as an interval scale but here the zero value indicates absence of the variable. E.g. 0 mg/l of Ca in water means that there is no calcium content in water. A water level of zero means there is no water at all. A scale of 5-point grading is used for performance of pit latrines in Kampala as fair, good, very good, excellent. What is the level of data measurement? Ordinal. 10 P. Nyenje (MAK) MEC7116 1.5 Collecting Engineering Data Most data are observations in a population. In Engineering, we always work with a sample selected from a population. There are there basic methods of collecting engineering data which then then defines the type of research being undertaken: (i) Retrospective – Historical data (ii) Observational (iii) Designed experiments 1.5.1 Retrospective study This approach uses all or a sample of the historical data archived over some period. These can include discharges of a river, precipitation data, traffic counts data, water quality of streams etc. This approach can have some challenges as outlined below: For any two types of variables being investigated, relationships may not exist for a given period of record. Changes may be occurring at smaller time steps yet the data was recorded at longer intervals of time. Recording errors that may not be explained Some of the above mentioned challenges can be mitigated by making your own observations or carrying out experiments. 1.5.2 Observational studies The Engineer observes the processes in the population. This can for example include the measurements of water quality of a river at smaller time intervals, conducting detailed traffic counts, testing various concrete trial mixes and so on. Observational studies are usually conducted for a relatively short period. Nonetheless, they allow for inclusion of other variables of interest that are not usually recorded. It is usually not possible to determine the cause-effect relationship in such studies because there are other many other confounding factors that may influence the outcome. 1.5.3 Sampling techniques Retrospective and observational studies are usually carried out on a sample collected from the population in a random framework. If data are not collected in a random framework, then the sample is said to be biased and statistical analyses performed would not be reliable. The most common sampling techniques are: o Simple random sampling – any particular sample of a specified sample size has the same chance of being selected as any other sample of the same size. o Stratified sampling – population is divided into groups called strata that have similar cases grouped together. Then simple random sampling, is employed within each stratum Other techniques available are cluster and multi-stage sampling. 11 P. Nyenje (MAK) MEC7116 1.5.4 Designed experiments In this type of research study, the Engineer deliberately changes the variables in an experiment and observes the system output and then makes an inference or decision about which variables are responsible for the observed changed. These decisions can easily be made through ANOVA statistical techniques. Experiments can establish the cause-effect relationships. That is to say, they can establish causal relationships between two or more variables. In observational studies, you can only determine associations between variables. Example: Consider a problem involving the choice of optimum concrete mix design to achieve a certain compressive strength. Concrete mix design usually has three material inputs (or factors) that determine its strength i.e. Cement, sand and aggregates. The Engineer can choose to perform a series of tests for each mix design to obtain the strengths of concrete. He/She would then be interested in knowing if there is any difference between the different mix designs (or treatments). To answer this research question, the approach to use would be to compare the mean strengths of the different mix designs => this is a problem of hypothesis testing (inferential statistics). The following terms are commonly used in design of experiments Factors: The above example has three factors and we would want to investigate how the three factors affect the strength of concrete Factor levels: The specified values for each factor are called factor levels. Typically, we use a small number of levels for each factor e.g. two or three. For example low and high for each factor. Treatments: represent each combination of factor levels => gives number of experiments. A reasonable experimental design strategy would use every possible combination of the factor levels to form a basic experiment with eight different settings => This is called a factorial experiment. Considering only two factor levels, high (or +) and low (or -), the number of experiments or treatments required for the above example would be 2n where n is the number of factors. i.e. 23 = 8 experiments The Designed Experiment (Factorial Design) for the Concrete mix design Cement Sand Aggregates −1 −1 −1 +1 −1 −1 −1 +1 −1 +1 +1 −1 −1 −1 +1 +1 −1 +1 −1 +1 +1 +1 +1 +1 12 P. Nyenje (MAK) MEC7116 Factorial experiments allow one to detect an interaction between factors. A response surface methodology can be used to investigate the effects of these factors on the output. As n increases, the number of tests increases making the experiment unfeasible in terms of time and resources. A fractional factorial experiment can then be performed where only a subset (e.g. half) of the factor combinations are actually tested while at the same time considering all factors. 13 P. Nyenje (MAK) MEC7116