Modeling Random Phenomena: Input Data Distribution PDF

THE INPUT DATA DISTRIBUTION MODELING RANDOM PHENOMENA DECIDING ON THE SIMULATION INPUT DATA  Examples of input may be customer interarrival times, priority level, service times  How do we determine what to use as input data to the simulation model? We have several possibilities: 1) Constant – no randomness 2) Make an assumption about the input distribution (and its parameters), based on theory or past research 3) Use historical data as is 4) Use data to fit a distribution. Then we can use Monte Carlo sampling to sample from a well-known theoretical distribution rather than use historical data as input. Why is this better? The theoretical distributions have well-known characteristics; additionally, we can extrapolate to values in the distribution other than those sampled. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 2 DECIDING ON THE SIMULATION INPUT DATA Q: What are the main advantages of using a theoretical probability distribution rather than historical data as the input to a simulation model?  Theoretical distributions have well-known characteristics, such as the mean, variance, skewness, etc. This provides more information about the underlying random process.  Theoretical distributions allow for extrapolation to values not present in the historical data sample, enabling the simulation to explore a wider range of potential scenarios.  Theoretical distributions can be more parsimonious, requiring fewer parameters to be estimated compared to using the historical data directly. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 3 IDENTIFYING THE DATA DISTRIBUTION Could the data values I observed have come from a certain specified probability distribution? Steps: 1) Collect data 2) Summarize in a frequency distribution (histogram) to identify the shape of the distribution 3) Identify underlying theoretical probability distribution, or a family of distributions 4) Obtain parameter(s) for the distribution chosen, probably estimated from the data. 5) Test for fit MODELING RANDOMNESS - THE INPUT DISTRIBUTION 4 THE HISTOGRAM  The data is generally collected in intervals of equal size (if the data supports that) and then graphed as a vertical bar chart, or histogram.  In this figure from the Banks, Carson textbook, the first attempt to organize the data results in (a) a “ragged” histogram that is not very useful. Then, combining adjacent cells first results in a chart that is (b) too coarse and then finally one that is (c) “just right” – er, appropriate. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 5 DECIDING ON THE SIMULATION INPUT DATA Describe the key considerations in selecting the class intervals when creating a histogram for continuous data.  Number of intervals: Too few intervals will result in a histogram that is too coarse and may miss important features of the data distribution. Too many intervals can lead to a "ragged" histogram that is difficult to interpret.  Interval width: The width of each class interval should be consistent and chosen to provide an appropriate level of detail. Smaller interval widths can better capture the shape of the distribution but may result in some intervals having very low frequencies.  Interval endpoints: The choice of where to set the interval endpoints can affect the appearance of the histogram, especially near the tails of the distribution. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 6 DECIDING ON THE SIMULATION INPUT DATA Why is it important to create a histogram of the data before identifying the underlying probability distribution?  The histogram provides a visual representation of the shape of the data distribution, which can give clues as to the appropriate theoretical distribution (e.g., normal, exponential, Poisson, etc.).  The histogram can help identify potential outliers or unusual features of the data that may need to be addressed before fitting a theoretical distribution.  The histogram can be used to estimate initial parameter values for the theoretical distribution, which can aid in the fitting process.  The histogram serves as a way to verify the appropriateness of the final fitted distribution by comparing the histogram to the probability density function of the chosen distribution. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 7 HOW TO HISTOGRAM  The procedure will differ slightly depending on whether your data is discrete or continuous  Some examples: Discrete Data Continuous Data Number of defects Weekly production Number of jobs in Time queue MODELING RANDOMNESS - THE INPUT DISTRIBUTION 8 EXAMPLE: WEEKLY PRODUCTION (CONTINUOUS) Relative Weekly  Here we Frequency have set up a frequency distribution with class intervals. production Frequen P(X) (X) cy Below 46 1 0.008 46 – 55 1 0.008 56 – 65 3 0.025 66 – 75 7 0.058 76 – 85 11 0.092 86 – 95 21 0.175 96 – 105 28 0.234 106 – 115 16 0.134 116 – 125 22 0.183 126 – 135 7 0.058 136 – 145 1 0.008 146 and up 2 0.017 120 MODELING RANDOMNESS - THE INPUT DISTRIBUTION 1.000 9 EXAMPLE: TIME TO COMPLETE A TASK (CONTINUOUS)  Note that time data is always considered continuous. X (minutes) Frequenc Relative y Frequency 10 but less than 6 Mean = 37.3 minutes 20 0.06 Median = ? 20 but less than 25 Mode = ? 30 0.25 30 but less than 32 What can we do with these 40 0.32 statistics? 40 but less than 23 50 0.23 50 but less than 7 60 0.07 60 but less than 5 70 RANDOMNESS - THE INPUT DISTRIBUTION MODELING 0.05 10 70 but less than 2 80 0.02 EXAMPLE : NUMBER OF TELEPHONE INQUIRIES PER 1-HOUR INTERVAL (DISCRETE) # inquiries # 1-hr Relative (N) intervals Frequency with N 0 315 0.619 1 142 0.279 2 40 0.078 3 9 0.018 4 2 0.004 5 1 0.002 509 1.000 Frequency is the number of times an event occurs Relative frequency is the number of times an event occurs compared to the total number of possible events in the sample space [We will return to this example later in this lecture.] MODELING RANDOMNESS - THE INPUT DISTRIBUTION 11 EXAMPLE: NUMBER OF DEFECTS (DISCRETE) X Frequenc P(X) y 1 10 0.03  What is the 2 10 0.03 3 20  mean = ? 0.06 4 20 0.06  median = ? 5 40 0.11  mode = ? 6 50 0.14 7 70 0.20  Is this distribution normal? symmetric? 8 60 0.17 (need to graph to be more specific) 9 50 0.14 10 20 0.06 350 1.00 MODELING RANDOMNESS - THE INPUT DISTRIBUTION 12 IDENTIFYING THE PROBABILITY DISTRIBUTION How do we “assign” the input probability distribution based on the data we observed?  Theory  Shape of the curve  Estimated parameters MODELING RANDOMNESS - THE INPUT DISTRIBUTION 13 SOME PROBABILITY DISTRIBUTIONS  Uniform. All outcomes equally likely.  Normal. May model the distribution of a Models complete uncertainty. process that may be thought of as the sum (or Sometimes used as a first average) of a number of random processes, approximation. e.g., time to assemble a complex product that is the sum of the assembly times of the component parts of the product. Theoretical normal distribution ranges from -∞ to +∞. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 14 SOME PROBABILITY DISTRIBUTIONS  Lognormal. May model the distribution of a process that may be thought of as the product of component processes, e.g., rate of return on an investment when compounding interest. Source: Wikipedia (2018, March 9). Log-normal distribution. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 15 SOME PROBABILITY DISTRIBUTIONS  Binomial. Models the number of hits in n independent trials, when trials all have the same probability of a hit, p. For example, number of defective items in a lot of size n.  Negative Binomial. Models the number of trials to achieve k hits, e.g., number of items that must be inspected to find k defectives. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 16 SOME PROBABILITY DISTRIBUTIONS  Poisson. Models the number of  Exponential. May be used to model times, independent events that occur in a e.g., times between successive events. continuous interval, e.g., a fixed amount Related to the Poisson in an “inverse” of time or space. fashion. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 17 SOME PROBABILITY DISTRIBUTIONS  Gamma. Used to model nonnegative random variables.  Weibull. Used to model time Very flexible. Constant serves to shift distribution from to failure for component 0. parts of a system.  Beta. Used to model random variables with fixed upper and lower limits. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 18 SOME PROBABILITY DISTRIBUTIONS  Erlang. Models processes that may be viewed as the sum of several exponentially distributed processes, e.g., time to failure of a system based on exponentially distributed times to failure of component parts. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 19 SOME PROBABILITY DISTRIBUTIONS  Triangular. Models a process for which only the minimum, most likely, and maximum values of the distribution are known, e.g., the min, max, and most likely times required for product testing. Not perfect, but at least this is an improvement over a uniform distribution.  Empirical. Uses actual data to construct the distribution from which to sample. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 20 TESTING FOR FIT  We wish to test (statistically) the hypothesis that a set of observed data does not differ significantly from that which would be expected from a specified theoretical distribution. (Why?)  The χ 2 distribution allows us to test different hypotheses about frequencies (or proportions). We can use the χ 2 test statistic to measure the discrepancy that exists between an observed and an expected frequency: MODELING RANDOMNESS - THE INPUT DISTRIBUTION 21 TESTING FOR FIT USING 𝑥 2 χ2 = where fo ≡ observed frequency for each class or interval fe ≡ expected frequency for each class or interval as predicted by the theoretical distribution ≡ the sum over k Note that since the χ 2 statistic is based on sums of squares it can never be negative. The minimum value is 0, which you would get if every observed frequency is exactly equal to its corresponding expected frequency. (fo - fe) [Note: In order for the test to be valid, the expected frequencies must be at least 5 in each category.] MODELING RANDOMNESS - THE INPUT DISTRIBUTION 22 TESTING FOR FIT USING 𝑥 2  As with the Student’s t distribution, the χ 2 distribution is a series of distributions, i.e., a family of curves. There is a different χ 2 distribution for each value of degrees of freedom.  The degrees of freedom is one less than the number of classes (categories) used to test fit.  We note that while the χ 2 distribution is a continuous distribution, the calculated value of the test statistic is based on discrete counts. This continuous approximation of a discrete distribution works when the expected frequencies are large enough. The usual rule of thumb is that the expected frequencies should each be at least 5. Cells that do not meet this criterion should be combined with values in adjacent cells (and the degrees of freedom adjusted accordingly). MODELING RANDOMNESS - THE INPUT DISTRIBUTION 23 EXAMPLE. DISTRIBUTION OF A DIE Suppose we wish to test the hypothesis that a particular die follows a uniform distribution, i.e., that it is fair. The die is tossed 60 times with the results below. H0: There is no difference between the empirical and theoretical distributions. H1: There is a difference, i.e., Lack of fit. Alternatively, H0: the random variable follows the uniform distribution H1: the random variable does not follow the uniform distribution Test at α =.05, that is, we allow for a 5% probability that we will reject H 0 when it is true (the α error). MODELING RANDOMNESS - THE INPUT DISTRIBUTION 24 EXAMPLE. DISTRIBUTION OF A DIE X fo fe (fo – fe) (fo – ((fo – fe)2 fe)2)/fe 1 8 10 -2 4.4 2 12 10 2 4.4 3 10 10 0 0 0  Conclusion: Do Not Reject H0 4 11 10 1 1.1  Note the d.f. = 6 classes -1 = 5. We lose 1 5 12 10 2 4.4 degree of freedom because we are forcing 6 7 10 -3 9.9 the total of the expected frequencies to 60 60 0 χ 2 = 2.2 equal the total of the observed frequencies. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 25 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL # inquiries # 1-hr intervals Relative (N) with N Frequency 0 315 0.619 1 142 0.279 2 40 0.078 3 9 0.018 4 2 0.004 5 1 0.002 509 Frequency is the number of times an event (e.g., 0 1.000 occurs; Relative frequency is the calls) number of times an event occurs compared to the total number of possible events in the sample space How do we determine the input distribution? MODELING RANDOMNESS - THE INPUT DISTRIBUTION 26 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL  What to do? This kind of looks like an exponential The histogram: distribution. We could try testing H0 : the random variable follows the exponential distribution H1 : the random variable does not follow the exponential distribution  To get the expected frequencies for an exponential random variable, f(x) = λe-λx for x>0  But… MODELING RANDOMNESS - THE INPUT DISTRIBUTION 27 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL  But --- this data is closer to the definition of Poisson distributed data. And the shape of the histogram is not so far off from a Poisson. Let’s try that:  P (x events in an interval) =  We calculate a mean rate =.5147 inquiries per hour (lambda). We use this to calculate the expected relative frequencies.  Then… MODELING RANDOMNESS - THE INPUT DISTRIBUTION 28 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL H0 : the random variable follows the Poisson distribution H1 : the random variable does not follow the Poisson distribution With 5 degrees of freedom, and α =.05, the critical value of the statistic is 11.07 (same as example above). Poisson # Freq. Rel. Freq. Rel. Freq. Freq. inquirie (observe (observed) (Expected (expecte s d) ) d) 0 315 0.6189 0.5977 304.23 0.38 1 142 0.2790 0.3076 156.57 1.36 So we must Reject H0. 2 40 0.0786 0.0792 40.31 0.00 It’s close, though. 3 9 0.0177 0.0136 6.92 0.62 What to do now? 4 2 0.0039 0.0017 0.87 1.49 5 1 0.0020 0.0002 0.10 7.92 TOTAL 509 1.0000 1.0000 509 χ2 = 11.78 MODELING RANDOMNESS - THE INPUT DISTRIBUTION 29 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL Recall that for this test statistic to be considered valid and unbiased, the expected frequencies should each be greater than 5. Let’s combine the cells that violate this assumption. We combine frequencies for the last 3 categories (3, 4, 5 inquiries per hour = 9+2+1): # Freq. Rel. Freq. Poisson inquirie (observe (observed Rel. Freq. s d) ) Freq. (expecte At α =.05 and 3 degrees of (Expecte d) freedom, the critical value d) from χ 2 distribution is 7.815. 0 315 0.6189 0.5977 304.23 0.38 1 142 0.2790 0.3076 156.57 1.36 Conclusion: Do not reject H0. 2 40 0.0786 0.0792 40.31 0.00 >3 12 0.0236 0.0155 7.90 2.13 TOTAL 509 1.00 1.00 509 χ2 = 3.88 MODELING RANDOMNESS - THE INPUT DISTRIBUTION 30 EXAMPLE: NUMBER OF TELEPHONE INQUIRIES PER ONE HOUR INTERVAL  Sidebar: Why did we think Poisson would be a good fit? The data as described should fit the theoretical description of a Poisson process. The Poisson distribution applies to a discrete process where events occur randomly at a constant rate. These discrete events occur randomly within a continuous interval (e.g., time, space), with a constant average rate of occurrence (λ).  Also note: An alternative to the chi-square test, the Kolmogorov-Smirnov Goodness-of-Fit Test, is useful when sample size is small and parameters are not estimated from the data. A parameter-free statistical test. MODELING RANDOMNESS - THE INPUT DISTRIBUTION 31

Modeling Random Phenomena: Input Data Distribution PDF

Document Details

Tags

Related

Summary

Full Transcript