Statistics DISPENSA PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a general overview of statistical concepts, including descriptive and inferential statistics, as well as different types of variables and data collection methods. Topics covered include random sampling and systematic sampling.
Full Transcript
STATISTICS Prac$ce or science of collec)ng and analyzing numerical data in large quan$$es, especially for the purpose of inferring propor$ons in a whole from those in a representa$ve sample. It is the science that deals with the collec$on, classifica$on, analysis, interpreta$on of numerical facts...
STATISTICS Prac$ce or science of collec)ng and analyzing numerical data in large quan$$es, especially for the purpose of inferring propor$ons in a whole from those in a representa$ve sample. It is the science that deals with the collec$on, classifica$on, analysis, interpreta$on of numerical facts or data, through the use of mathema$cal theories of probability, imposes order and regularity on aggregates of more or less disparate elements. à Sta$s$cs is a tool to help process, summarize, analyze, and interpret data - It is about data collec)on, regarding numbers and data. Two branches of sta)s)cs: - DESCRIPTIVE, graphical, and numerical techniques to summarize and display the informa$on contained in a dataset. - INFERENTIAL, the area where not all the data related to something are available, and through this subset I want to say something about the whole. Sample data to make decision or predic$ons about a “larger” popula$on of data. Records and Variables: - N represents the popula$on size. N is related to the Popula$on, which is the whole of people, for instance, interviewed. “N” total number of observatories in a data set. For a given “variable” of interest (marital status, annual return) and a set of N “units” (individuals, stocks), a popula$on is the collec$on of the N values of the variable (one value for each unit). Popula)on is the collec)on of “N” units, or the maximum number of samples. - A sample is a subset of the popula)on - n represents the sample size – liKle n refers to a sample/subset. - A parameter is a specific characteris)c of a popula)on. - A sta)s)c is a specific characteris)c of a sample, which is calculated on the n- values of the sample. Random Sampling: Simple random sampling is a procedure by which - each member of the popula$on is chosen strictly by chance. - each member of the popula$on is equally likely to be chosen. The resul$ng sample is called a random sample, also known as uniform distribu$on. The formula used on Excel is =randbetween(1:n) / randbetween(1:N) Systema)c Sampling: First, it’s necessary to assure that the popula$on is arranged in a way that is not related to the subject of interest. What must be done is selecting every j-th item from the popula$on, where j is the ra$o of the popula$on size to the sample size, j = N/n. Then, randomly select a number from 1 to j for the first item selected. àhow many sample of 9 people I obtained from a popula$on of 72 Descrip)ve sta)s)cs àpresent data (e.g., tables) àsummarize data (e.g., mean of the observa$on) Inferen)al sta)s)cs àes$ma$on (e.g., es$mate the popula$on mean weight using the sample mean weight) àhypothesis tes$ng (e.g., test the claim that the popula$on mean weight is 140 pounds Inference is the process of drawing conclusions or making decisions about a popula=on based on sample results WHAT IS THE STARTING POINT OF EACH STATISTICAL ANALYSIS? 1. Iden$fy if you’re working with a popula'on or a sample: the calcula$on might be the same, but s$ll it is essen$al to know what we are working with; 2. Checking which is the total observa'ons available (N/n), e.g. there can be some func$ons who tell how many date are present in the columns; 3. Analyzing variables (columns): having something on the columns and it is necessary to state something on that type of variables, even if the final stage is more or less the same, the first stage it is beKer to always check the type of variables we are dealing with. CLASSIFYING VARIABLES: - Categorical, responses to groups or categories, e.g. “Are you a salesman? Yes/No” 1. Nominal style variables (there is no order, which means labels variable placed in the order that is preferred). Nominal data are considered the lowest or weakest type of data, since numerical iden$fica$on is chosen strictly for convenience and does not imply any ranking of responses; 2. Ordinal variables (values of data available can be ordered, indica$ng the rank ordering of items, with words describing responses. There is no measurable meaning to the “difference” between responses). With ordinals, the orders can be used ascending or descending; - Numerical, quan$ta$ve data that can be either 1. Discrete (counted items, similar to categorical ordinal, a type of variable which is countable, the values can be counted and they are not so many); 2. Con$nuous/ra$o data/interval data (Everything related to measurements. It is usually a certain value within a given range of real numbers and it usually arisesfrom a measurement process. Understanding how many frequencies – coun$ng of corresponding values - of one variable we have). Graphical presenta)on of data - from raw data to tables. Usual types of charts: - Categorical variables 1. Bar chart and pie - Numerical discrete 1. Bar chart and pie - Numerical con)nuous 1.histogram and pie 2. Ogive Describing CATEGORICAL VARIABLES can be done using frequency distribu$on tables and graphs, such as bar charts, pie charts, and Pareto diagram (not covered in the course). - TABULATING DATA: Frequency distribu)on It is a table used to organize data. Lee column commonly shows all possible responses on a variable being studied. The right column is a list of frequencies, or number or observa$ons, for each class. A rela)ve frequency distribu)on is obtained by dividing each frequency by the number of observa$ons and mul$plying the resul$ng propor$on by 100%. - GRAPHING DATA: Tables and charts When it comes to frequency distribu$on tables of a categorical variable, the classes that can be used are simply the possible responses to the categorical variable. Bar charts and pie charts are commonly used for categorical variables. Categorical/discrete - It is essen$al to extract dis$nct labels/values of the variable; - Tally, count the number of observa$ons for each dis$nct. Coun$ng can be done through some func$ons that allow to count values of each column. - Func$on name: count =coun$f(range; criteria) ; Copying down variables: $F$6 the cell gets copied everywhere; F$6 column free to move; - Rela$ve frequency: =value of absolute frequency/total number of observa$ons Cumula)ve absolute frequency: how many observa$ons are related to the variable of interest $ll (all included) the specific value; First cumula$ve absolute = first absolute / first cumula$ve rela$ve = first rela$ve Last cumula$ve absolute = N or n / last cumula$ve rela$ve = 1 or 100. BAR CHARTS AND PIE CHARTSà are used for qualita$ve (categorical data) àBar chart àPie chart PARETO DIAGRAM Used to portray categorical data àA bar chart, where categories are shown in descending order of frequency àA cumula$ve polygon is oeen shown in the same graph àUsed to separate the “vital few” from the “trivial many” Describing NUMERICAL VARIABLES can be done using frequency distribu$on and cumula$ve distribu$ons (by histogram and ogive) and stem-and-leaf display. - Discrete: e.g. size of family (1,2,3,…); number of credit cards owned (0,1,2,3,…); number of heart beats per minute (1,2,3,…)[usually from 45 to 90] à bar chart and pie - Con$nuous: e.g. income, height, infla$on rate, bond yieldsà histogram and pie/ ogive FREQUENCY DISTRUBUTIONS A frequency distribu$on is a list or a table containing - Discrete variable: dis$nct values and corresponding frequencies - Con$nuous variable: intervals or class groupings (within which the data fall) and the corresponding frequencies A frequency distribu$on is a way to summarize data à the distribu$on condenses the raw data into a more useful form and allows for a quick visual interpreta$on of the data How variables are classified - Stage (univariate); - Bivariate (two variables); 1. Con)ngency table (two categorical); 2. Cross table (at least one quan$ta$ve); - Mul)variate (more than two) One variable coun)f Two variables coun)fs, in which the frequencies inside are called joint frequencies. Those placed on the boKom and in the last row/column are called marginal frequencies and they are obtained by means of the sum of all the frequencies in the one column/row. Finally, there are some condi$onal distribu$ons, which are the distribu$ons of one variable given that the second one assumes a specific (predefined) value. If we assume that r is the number of the rows and c is the number of the columns, we have r condi$onal dist. for the variable on the columns; we have c condi$onal dist. for the variable on the rows. Recap: 1. Iden$fy variables; 2. Defining a frequency distribu$on (compu$ng rela$ve, cumula$ve frequencies); 3. Char$ng data (qualita$ve, quan$ta$ve) 4. Data presenta$on of errors INTERVALS =frequency (data array; intervals (max. age of the interval) DESCRIBING DATA NUMERICALLY Central tendency: - Arithme$c mean; - Median; - Mode. Numerical measures – the mean, median, and mode – are related to the loca)on of the center of a data. These numerical measures provide informa$on about a “typical” observa$on in the data and are referred to as measures of central tendency. Tendency: - Range; - Interquar$le range; - Variance; - Standard Devia$on; - Coefficient of Varia$on. Two possible ways to compute central tendency/varia$on: 1. Directly from raw data, since excel gives all the formulas needed to evaluate all the quan$$es of the indicators. e.g. Average, variance STDEV., picking up data, using the formula, pas$ng them in excel. 2. A liKle more complicated way, since excel cannot be used, which is through frequency distribu)on. The original data are not present, so there are no excel formulas, and it is essen$al to remember the original formulas, which are those that make you compute the variance, the average, etc… By means of sta$s$cal “original formulas”. Three typical ways to compute central tendency measurement: 1. The arithme)c mean of a set of data is the sum of the data values divided by the number of observa)ons. The arithme$c mean is the most common measure of central tendency. If the data set is the en$re popula$on of data, then the popula=on mean, µ, is a parameter given as the formula below, where N = popula$on size and Σ means “the sum of”. If the data set is from a sample, then the sample mean, 𝑥̅ , is a sta=s=c given by the formula below, where n = sample size. The mean is appropriate for numerical data. There’s no difference between popula$on and sample, at least in the mean. The nota$on is different in the sense when it comes to popula$on and sample. When the values move to the extreme, the mean doesn’t change that much, and some$mes it doesn’t even change. When the extremes of the mean gets larger, the mean is not supposed to change that much. Formula to be using in Excel =AVERAGE(range) 2. The median is the middle observa)on of a set of observa)ons that are arranged in increasing (or decreasing) order. If the sample size n, is an odd number, the median is the middle observa$on. If the sample size, n, is \an even number, the median is the average of the two middle observa$ons. The median will be the number located in the 0.50(n+1)th ordered posi$on. In an ordered list, the median is the “middle” number (50% above, 50% below). The median is a beKer indicator that express the midpoint. With extremes, since the median is the center point, the median remains free, because “how many observa$on are equal or below the median?” How to find the median? - Evaluate where the median is, so finding the number which is related to the posi$on of the median related to the set of data given. !"# Median posi$on = # $ $ posi$on in the ordered data. The median is whatever value you want between observa$on of posi$on 6 and posi$on 7, whatever value you want!!! =MEDIAN(range) 3. The mode is a measure of central tendency. The mode, if one exists, is the most frequently occurring value. A distribu$on with one mode is called unimodal; with two modes, it is called bimodal; and with more than two modes, the distribu$on is sais to be mul$modal. The mode is most commonly used with categorical data. =MODE(range) The decision as to whether the mean, median, or mode is the appropriate measure to describe the central tendency of data is context specific. One factor that influences our choice is the type of data, categorical or numerical. Categorical data are best described by the median or the mode, not the mean, which fits beKer to numerical data. MEASURES OF LOCATION Percen)les and Quar)les Percen$les and quar$les are measures that indicate the loca$on or posi$on of a certain value related to a whole set of data. To find percen$les and quar$les, data must first be arranged in order from the smallest to the largest values. The Pth percen)le is a value such that approximately P% of the observa$ons are at or below that number. Percen)les separate large, ordered data sets into 100ths. Pth percen$le = value located in the (P/100)(n+1)th ordered posi$on. Quar)les are descrip$ve measures that separate large data sets into four quarters. The first quar)le, Q1, or 25th percen$le, separates approximately the smallest 25% of the data from the remainder of the fata. The second quar)le, Q2, or 50th percen$le the median. Five-Number Summary The five-number summary refers to the five descrip$ve measures: minimum, first quar$le, median, third quar$le, maximum. The 5 number summary is an exploratory data analysis tool that provides insight into the distribu$on of values for one variable. Collec$vely, this set of sta$s$cs describes where data values occur, their central tendency, variability, and the general shape of their distribu$on. The purpose of the five number summary is to provide a preliminary sense of your data during the exploratory phase of analysis and that’s why sta$s$cians picked these five sta$s$cs because they are less sensi$ve to skewed distribu$ons and outliers. The sta)s)cs in the 5 number summary are more robust than the mean and standard devia$on. Weighted Mean In some situa$on it is necessary to compute a special type of mean, called weighted mean, in which it is essen'al to stress the presence of more/less important values in the same dataset. One everyday example of weighted mean is the calcula$on of the GPA (media ponderata). fi is the weight of the ith observa$on, while n is the sum of all weights. The formula can be wriKen as the sum of the product between the mean and the weight of the ith observa$on (w) over the sum of all the weights of each observa$on. MEASURES OF VARIABILITY Measures of varia)on give informa$on on the spread or variability of the data values In business, varia)on is seen in sales, adver$sing costs, the percentage of product complaints, the number of new customers, and so forth. While two data sets could have the same mean, the individual observa$ons in one set could vary more from the mean than do the observa$ons in the second set. Consider the following two sets of sample data: Although the mean is 10 for both samples, clearly the data in sample A are farther from 10 than are the data in sample B. We need descrip$ve numbers to measure this spread. - Range It is the simplest measure of varia$on, and it measures the difference between the largest and the smallest observa$ons. The greater the spread of the data from the center of the distribu=on, the larger the range will be. Since the range takes into account only the largest and smallest observa$ons, it will provide a considerable distor$on if there is an unusual extreme observa)on. Although the range measures the total spread of the data, the range may be an unsa$sfactory measure of variability (spread) because outliers, either very high or very low observa)ons, influence it. One way to avoid this difficulty is to arrange the data in ascending or descending order, discard a few of the highest and a few of the lowest numbers, and find the range of those remaining. Some$mes the lowest 25% of the data and the highest 25% of the data will be removed. To do this, we define quar$les and the interquar$le range, which measures the spread of the middle 50% of the data. - Interquar)le range The interquar)le range (IQR) measures the spread in the middle 50% of the data; it is the difference between the observa$on at Q3, the third quar$le (or 75th percen$le), and the observa$on at Q1, the first quar$le (or 25th percen$le). - Box-and-Whisker Plots It is a graph that describes the shape of a distribu$on in terms of a five-number summary: the minimum value, first quar$le (25th percen$le), the median, the third quar$le (75th percen$le), and the maximum value. The inner box shows the numbers that span the range from the first to the third quar$le. A line is drawn through the box at the median. There are two “whiskers.” One whisker is the line from the 25th percen$le to the mini- mum value; the other whisker is the line from the 75th percen$le to the maxi- mum value. The plot can be oriented horizontally or ver$cally. - Variance Although range and interquar$le range measure the spread of data, both measures take into account only two of the data values. What we actually need is a measure that would average the total (S) distance between each of the data values and the mean. But for all data sets, this sum will always equal zero because the mean is the center of the data. If the data value is less than the mean, the difference between the data value and the mean would be nega$ve (and it wouldn’t make any sense since distance cannot be nega$ve for sure). If each of these differences is squared, then each observa$on (both above and below the mean) contributes to the sum of the squared terms. The average of the sum of squared terms is called the variance. With respect to variance, the popula'on variance, s2, is the sum of the squared differences between each observa$on and the popula$on mean divided by the popula$on size, N: The sample variance, s2, is the sum of the squared differences between each observa$on and the sample mean divided by the sample size, n, minus 1: Why is the equa$on for the sample variance divided by n-1? Since our goal is to find an average of squared devia$ons about the mean, one would expect division by n. The thing is, if we were to take a very large number of samples, each of size n, from the popula$on and compute the sample variance, as given in the equa$on for each of these samples, then the average of all these sample variances would be the popula$on variance, s2. This is all because the sample variance is an “unbiased es)mator” of the popula$on variance s2. In this regard, sta$s$cians have shown that if the popula$on variance is unknown, a sample variance is a be_er es)mator of the popula$on variance if the denominator in the sample variance is (n - 1) rather than n. - Standard Devia)on To compute the variance requires squaring the distances, which then changes the unit of measurement to square units. Here comes the standard devia'on, which is the square root of variance, and it restores the data to their original measurement unit. If the original measurements were in feet, the variance would be in feet squared, but the standard devia$on would be in feet. The standard devia$on measures the average spread around the mean. Sample standard devia=on (s) and popula=on standard devia=on (s) only differ in their denominator. - Coefficient of Varia)on E.g. Let’s consider an investment opportunity. We have two stocks, and the mean closing prices of these stocks over the last several months are not equal. We need to compare the coefficient of varia$on for both stocks rather than the standard devia$ons. The coefficient of varia'on expresses the standard devia'on as a percentage of the mean. It is useful because it shows the varia$on rela$ve to the mean and it can be used to compare two or more sets of data measured in different units. Starting point: RAW DATA à FREQ- DISTRUBUTIONS àSUMMARY STATISTICS It is possible to go back to raw data and freq. Distrubution? Yes (raw data), if you have all the percentiles, the cdf (cumulative distribution function) The problem is mainly related to the availability of only mean and standard deviation 1. generally speaking i am always able to map approximately the frequency distribution à CHEBYCHEV’S THEOREM 2. if original raw data belong to a specific distribution (normal or gaussian) à EMPIRICAL RULE CHEBYSHEV’S THEOREM Given the summary sta's'cs, is it possible to go back to the original frequency distribu'on, raw data? The Russian mathema$cian Chebyshev established data intervals for any data set, regardless of the shape of the distribu$on. For any popula$on with mean μ and standard devia$on s , and k > 1 , the percentage of observa$ons that fall within the interval [μ + kσ] is at least 100[1-(1/k2)]% Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard devia$ons of the mean (for k > 1) At least within (1 - 1/1.52) = 55.6% k = 1.5 (μ ± 1.5σ) (1 - 1/22) = 75% k = 2 (μ ± 2σ) (1 - 1/32) = 88.89% k = 3 (μ ± 3σ) How to solve it - given an interval: - find the HALFWIDTH of the intervalà one extreme – mean - find K àhalfwidth/standard devia$on - find the percentage of observa$on that falls within the interval à 1-(1/k^2)* 100 % In order to find the extremes of the interval, given the mean, the standard devia$on and the percentage of observa$on within the interval: - find k à percentage %= 1-(1/k^2) à find the value of k - having k, I can compute the extremis of the interval MAX: mean + standard devia$on*k MIN: mean – standard devia$on*k NORMAL (GAUSSIAN) DISTRIBUTION OF DATA – EMPIRICAL RULE If the data distribu$on is normal (or Gaussian – bell shaped) then: if k = 1 à the interval contains 68% of the values if k=2 à the interval contains 95% of the values if k=3 à the interval contains almost all the values: 99.7% The empirical rule specifies the propor$on of a normal distribu$on within 1, 2, and 3 standard devia$ons of the mean. GROUPED DATA Suppose that the n sample values are presented to us grouped into K classes with frequencies f1 , f2 ,..., fK and let the midpoints of the classes be m1 , m2 ,..., mK MEAN VARIANCE MEASURES OF RELATIONSHIPS BETWEEN VARIABLES CO-VARIANCE: is the measure of variability (varia'on) in common between two variables. Measure the strength of the rela$onship between the two considered variables when this rela$onship is linear. =COV. (variable1, variable 2) Average of the products of the devia$ons of the two-variable form their average à the difference between popula$on and sample is that in popula$on we have N as denominator, while in the sample we have “n-1” the COV (variable 1, variable 1) = to the VAR (variable 1) à it means that if I am analyzing just one variable COV=VAR IT CAN BE: - POSITIVE: they both have the same direc$on - NEGATIVE - ZERO: there is no rela$on COEFFICIENT CORRELATION: related to covariance (similar to coefficient of varia$on) – it measures the rela$ve strength of the linear rela$onship between two variables = COV/STDV(variable1)* STDV(variable2) - It is between the set (-1;1). - The closer to –1, the stronger the nega$ve linear rela$onship - The closer to 1, the stronger the posi$ve linear rela$onship - The closer to 0, the weaker any linear rela$onship PROBABILITY RANDOM EXPERIMENT = a process leading to an uncertain outcome BASIC OUTCOME = a possible outcome of a random experiment SAMPLE SPACE (S) = the collec$on of all possible outcomes of a random experiment EVENT (E)= any subset of basic outcomes from the sample space INTERSECTION OF EVENTS = if A and B are two events in a sample space S, then the intersec$on, A Ç B is the set of all outcomes in S that belong to both A and B A and B are MUTUALLY EXCLUSIVE EVENTS if they have no basic outcomes in common i.e. the set A Ç B is empty à the elements are all outside and not included in the set UNION OF EVENTS – if A and B are two events is a sample space S, then the union A È B is the set of all outcomes in S that belong either to A or B Events E1, E2, …Ek are COLLECTIVELY EXHAUSTIVE events if E1 È E2 È… Ek = S i.e., the events completely cover the sample space. The COMPLEMENT OF AN EVENT A is the set of all basic outcomes in the sample space that do not belong to A à the complement is denoted as A bar. EXAMPLE Let the SAMPLE SPACE be the collec$on of all possible outcomes of rolling one die: Let A be the event “number rolled is even.” Let B be the event “number rolled is at least 4” Then A= (2,4,6) and B=(4,5,6) MUTUALLY EXCLUSIVE A and B are not mutually exclusive à the outcomes 4 and 6 are common to both COLLECTIVELY EXHAUSTIVE A and B are not collec$vely exhaus$ve à A È B does not contain 1 or 3 PROBABILITY is the chance that an uncertain event will occur (always between 0 and 1) ASSESING PROBABILITY: 1. CLASSICAL PROBABILITY 2. RELATIVE FREQUENCY PROBABILITY 3. SUBJECTIVE PROBABILITY CLASSICAL PROBABILITY It assumes that all outcomes in the sample space are equally likely to occur Classical probability of an event A: The number of possible ways of arranging x objects in order is Where x! is read as “x factorial” (how many orderings you can set up star$ng from a set of objects PERMUTATIONS is the number of possible arrangements when x objects are to be selected from a total of n objects and arranged in order (with n-x objects lee over) On excelà =PERMUT(number;selected number) COMBINATIONS is the number of combina$ons of x objects chosen from n - it can be wriKen as On excel à =COMBIN(number;selected number)) EXAMPLE Suppose that two leKers are to be selected from A, B, C, D and arranged un order à how many PERMUTATIONS are possible? The number of permuta$on, with n=4 and x=2 is 4! 𝑃$% = = 12 (4 − 2)! The permuta)ons are: AB – AC – AD – BA – BC – BD CA - CB - CD - DA- DB - DC à how many COMBINATIONS are possible? (order is not important) 4! 𝐶$% = 2! (4 − 2)! AB (same as BA) – BC (same as CB) The combina)ons are: AC (same as CA) – BD (same as DB) AD (same as DA) – CD (same as DC) RELATIVE FREQUENCY PROBABILITY The limit of the propor$on of $mes that an event A occurs in a large number of trials, n SUBJECTIVE PROBABILITY An individual opinion or belief about the probability of occurrence PROBABILITY POSTULATES 1. If A is any event I the sample space S, then: 2. If tow events A and B are disjoint, then: 3. The probability of the sample space S is equal to 1: PROBABILITY RULES - The COMPLEMENT RULE - The ADDITION RULE à the probability of the union of two events is: Example: - - CONDITIONAL PROBABILITY RULE à is the probability of one event “given that another event has occurred.” - the MULTIPLICATION RULE Example: - INDEPENDENCE RULE à TWO EVENTS are independent if and only if or MULTIPLE EVENTS are independent if and only if Example: - The BAYES’ THEOREM à it is a formula used for calcula$ng condi$onal probabili$es A drilling company has es$mated a 40% chance of striking oil for their new well. It decides to run a geological test. Historically, of all successful wells, 60% have had a geological test sugges$ng the presence of oil, while, of all unsuccessful wells, 20% have had a geological test sugges$ng the presence of oil. Given that a geological test suggests the presence of oil for the new well, what is the probability that the new well will be successful? à Let S = successful well and U = unsuccessful well o P(S) =.4 , P(U) =.6 (prior probabili$es) o D: geological test suggests presence of oil o Condi$onal probabili$es: P(D|S) =.6 P(D|U) =.2 àGoal is to find P(S|D) >> apply Bayes theorem - JOINT PROBABILITIES RULE - MARGINAL PROBABILITIES RULE Example: RANDOM VARIABLES A random variable represents a possible numerical value from a random experiment DISCRETE RANDOM VARIABLE takes on no more than a countable number of values. CONTINUOS RANDOM VARIABLE takes on any value in an interval à possible values are measured on a con$nuum. PROBABILITY DISTRUBUTIONS FOR DISCERETE RANDOM VARIABLES Let X be a discrete random variable and x be one of its possible values The probability that X takes the value x is denoted P(X = x) The probability distribu$on (func$on) of a random variable X is P(x)=P(X = x), for all x Can be represented algebraically, graphically, or with a table REQUIRED PROPERTIES: for any value of x the individual probabili$es sum to 1 CUMULATIVE PROBABILITY FUNCTION The cumula$ve probability func$on, denoted F(x), shows the probability that X does not exceed the value x DERIVED RELATIONSHIP The derived rela$onship between the probability distribu$on and the cumula$ve probability distribu$on. Let X be a random variable with probability distribu$on P(x) and cumula$ve probability distribu$on F(x). Then (the nota$on implies that summa$on is over all possible values of x that are less than or equal to x 0 ) DERIVED PROPERTIES: Derived proper$es of cumula$ve probability distribu$ons for discrete random variables Let X be a discrete random variable with cumula)ve probability distribu)on F(x 0). Then 1. 0 ≤ F(x 0) ≤ 1 for every number x 0 2. for x 0 < x1, then F(x 0) ≤ F(x1) MEAN àExpected value (or mean) of a discrete random variable X: VARIANCE à the variance pf a discrete random variable X STANDARD DEVIATION à the standard devia$on of a discrete random variable FUNCTIONS OF RANDM VARIABLES If P(x) is the probability func$on of a discrete random variable X, and g(X) is some func$on of X, then the expected value of func$on g is We can re-interpret the formulae for the variance as: DISCRETE PROBABILITY DISTRIBUTIONS can be: 1. Binomial 2. Poisson 3. Hypergeometric BERNOULLI DISTRUBUTION Consider a trial with only two possible outcomes: “success” or “failure” Let P denote the probability of success Let 1 – P be the probability of failure Define random variable X: X= 1 if success, X = 0 if failure Then the BERNOULLI PROBABILITY DISTRIBUTION is The MEAN is: The VARIANCE is BINOMIAL DISTRIBUTION A binomial distribu)on can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated mul$ple $mes. The binomial is a type of distribu$on that has two possible outcomes set of trail (not one trial as in Bernoulli) The first variable in the binomial formula, n, stands for the number of times the experiment runs. The second variable, p, represents the probability of one specific outcome. Binomial distributions must also meet the following three criteria: 1. The number of observations or trials is fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times. This is common sense—if you toss a coin once, your probability of getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a tails is very, very close to 100%. 2. Each observation or trial is independent. In other words, none of your trials have an effect on the probability of the next trial. 3. The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another. Once you know that your distribution is binomial, you can apply the binomial distribution formula to calculate the probability. On EXCEL à BINOM.DIST(number x, n, P, false) POISSON DISTRIBUTIONS The Poisson distribu$on is used to determine the probability of the number of occurrences of a certain event in a given àTime interval àSurface area Examples: - number of telephone calls arriving at the switchboard of a large company in one minute - number of customers arriving at a service desk in a fixed interval Assume an interval is divided into a very large number of equal subintervals where the probability of the occurrence of an event in any subinterval is very small. POISSON DISTRIBUTION ASSUMPTIONS 1. The probability of the occurrence of an event is constant for all subintervals. 2. The probability of two or more occurrences in each subinterval is negligible in comparison with the probability of one occurrence. 3. Occurrences are independent; that is, an occurrence in one interval does not influence the probability of an occurrence in another interval. POISSON DISTRIBUTION CHARACTHERISTICS - MEAN - VARIANCE - STANDARD DEVIATION PROBABILITY DISTRIBUTION FOR CONTINUOS RANDOM VARIABLES A con)nuous random variable is a variable that can assume any value in an interval àthickness of an item à$me required to complete a task àtemperature of a solu$on àeight, in inches These can poten$ally take on any value, depending only on the ability to measure accurately. CUMULATIVE DISTRIBUTION FUNCTION The cumula)ve distribu)on func)on, F(x), for a con$nuous random variable X expresses the probability that X does not exceed the value of x Let a and b be two possible values of X, with a < b. The probability that X lies between a and b is PROBABILITY DENSITY FUNCTION The probability density func)on, f(x), of random variable X has the following proper$es: - f(x) ≥ 0 for all values of x. àThe set of values x for which f(x)>0 is called the support of the density f(x) or of the r.v. X. - The area under the probability density func$on f(x) over all values of the random variable X within its range, is equal to 1 - The probability that X lies between two values a and b is the area under the density func$on graph between the two values Shaded area under the curve is the probability that X is between a and b The cumula$ve distribu$on func$on F(x) is the integral of the probability density func$on f(.) up to x. PROBABILITY DISTRIBUTIONS FOR COUNTINUOS RANDOM VARIABLES can be: 1. Uniform distribu'on 2. Normal distribu'on 3. Exponen=al distribu=on UNIFORM DISTRIBUTION The uniform distribu)on is a probability distribu$on that has equal probabili)es for all equal-width intervals within the range of the random variable The expected valued (mean) of X The variance of X is defined as the expecta$on of the squared devia$on, (X - μ)^2, of a random variable from its mean MEAN AND VARIANCE OF THE UNIFORM DISTRIBUTION MEANà VARIANCE à NORMAL DISTRIBUTION Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution model is important in statistics and is key to the Central Limit Theorem (CLT) CHARACTERISTICS: - Bell shaped - Symmetrical - Mean, mode and median are equal - The support is from −∞ 𝑡𝑜 + ∞ - Integral over the support is 1 (the area under the curve is 1 The density formula is à The cumula$ve distribu$on func$on isà For a normal $ random variable X with 𝜇 as mean and 𝜎 as variance , we have 𝑋~𝑁(𝜇, 𝜎 $ ) The probability of a range of values is measured by the area under the curve FINDING NORMAL PROBABILITIES 1. F(b)=P(X or equal to 13) à use the COMPLEMENT Suppose X has a normal distribu$on N(10;9) μ=10 σ=3 =1-NORM.DISTR(x;mean;stdev;true) 3. P(a