Super Thinking (PDF) - Conditional Probability & Bayes' Theorem

Document Details

scrollinondubs

Uploaded by scrollinondubs

Stanford School of Medicine

Tags

conditional probability bayes' theorem statistics probability

Summary

This document discusses conditional probability, including examples of how additional information can change probability estimates. It explains how the probability of one event happening given another can be di erent from the probability of the second event given the first. The concept of Bayes' Theorem is also introduced. The document also explores di erent approaches to probability, including frequentist, highlighting how sample size influences conclusions. It also cautions against the base rate fallacy.

Full Transcript

= 3 0 The second row shows a plot of this sample distribution for an approval rating based on polling two randomly selected people. This plot looks di erent from the original distribution, but still nothing like a normal distribution, as it can have only three outcomes: two approvals (the spike...

= 3 0 The second row shows a plot of this sample distribution for an approval rating based on polling two randomly selected people. This plot looks di erent from the original distribution, but still nothing like a normal distribution, as it can have only three outcomes: two approvals (the spike at 1), two disapprovals (the spike at 0), or one approval and one disapproval (the spike at 0.5). If you base the polls on asking ve people, the sample distribution starts to look a bit more like a bell shape with six possible outcomes (third row). With thirty people (thirty-one outcomes, depicted in the fourth row), it starts to look a lot like the characteristic bell-curve shape of the normal distribution. As you ask more and more people, the sample distribution becomes more and more like a normal distribution, with a mean of 25 percent, the true approval rating from the population distribution. Just as in the case of body temperatures or heights, while this mean is the most likely value obtained by the poll, values close to it are also likely, such as 24 percent. Values further and further away are less and less likely, with probabilities following the normal distribution. How much less likely, exactly? It depends on how many people you ask. The more people you ask, the tighter the distribution. To convey this information, polls like this usually report a margin of error. An article describing the poll results might include something like “Congress has a 24 percent approval rating with a margin of error of ±3 percent.” The “±3 percent” is the margin of error, but where this margin of error comes from or what it really means is rarely explained. With knowledge of the above mental models, you now can know! The margin of error is really a type of con dence interval, an estimated range of numbers that you think may include the true value of the parameter you are studying, e.g., the approval rating. This range has a corresponding con dence level, which quanti es the level of con dence you have that the true value of parameter is in the range you estimated. For example, a con dence level of 95 percent tells you that if you ran the poll many times and calculated many con dence intervals (one for each poll), on average 95 percent of them would include the true approval rating (i.e., 25 percent). Most media reports don’t mention the con dence level used to calculate their margin of error, but it is usually safe to assume they used 95 percent. Research publications, by contrast, are usually more explicit in stating what con dence levels they used to represent the uncertainty in their estimates (again typically, though not always, 95 percent). For the approval-rating scenario, the range is calculated using the fact that the central limit theorem tells us that the sample mean is approximately normally distributed, so we should expect 95 percent of possible values to be found within two standard deviations of the true mean (i.e., the real approval rating). The part that hasn’t been explained yet is that the standard deviation of this distribution, also called the standard error, is not the same as the sample standard deviation calculation from earlier. However, these two values are directly related. In particular, the standard error is the same as the standard deviation for the sample, divided by the square root of the sample size. This means that if you want to reduce the margin of error by a factor of two, you need to increase the sample size by a factor of four. For a yes/no poll like the approval rating, a margin of error of 10 percent is achieved with just 96 people, 5 percent at 384 people, 3 percent at 1,067 people, and 2 percent at 2,401. Since the margin of error is an expression of how con dent the pollsters are in their estimate, it makes sense that it is directly related to the size of the sample group. The illustration on the next page shows how con dence intervals work for repeated experiments. It depicts one hundred 95 percent con dence intervals for the probability of ipping heads. Each was calculated from an experiment that involved simulating ipping a fair coin one hundred times. These con dence intervals are represented graphically in the gure by error bars, which are a visual way to display a measure of uncertainty for an estimate. 95% Con dence Intervals from 100 Fair Coin Flips Experiment Repeated 100 Times Error bars are not always con dence intervals; they could be derived from other types of error calculations too. On an error bar, the dot in the middle is the parameter estimate, in this case the sample mean, and the lines at the end indicate the top and bottom of the range, in this case the con dence interval. The error bars in the plot vary due to what was seen in the di erent experiments, but they each span a range of about twenty percentage points, which corresponds to the ±10 percent mentioned above (with a sample size of one hundred ips). Given the 95 percent con dence level, you would expect ninety- ve of these con dence intervals to include the true mean of 50 percent. In this case, ninety-three of the intervals included 50 percent. (The seven intervals that didn’t are highlighted in black.) Con dence intervals like these are often used as estimates of reasonable values for a parameter, such as the probability of getting heads. However, as you just saw, the true value of the parameter (in this case 50 percent) is sometimes outside a given con dence interval. The lesson here is, you should know that a con dence interval is not the de nitive range for all possible values, and the true value is not guaranteed to be included in the interval. One thing that really bothers us is when statistics are reported in the media without error bars or con dence intervals. Always remember to look for them when reading reports and to include them in your own work. Without an error estimate, you have no idea how con dent to be in that number—is the true value likely really close to it, or could it be really far away from it? The con dence interval tells you that! IT DEPENDS As you saw in the last section, the average woman’s height is ve feet four inches. If you had to guess the height of a random stranger, but you didn’t know for a fact that they were a woman, ve feet four inches wouldn’t be a great guess because the average man is closer to ve feet nine inches, and so something in the middle would be better. But if you had the additional information that the person was a woman, then ve feet four inches would be the best guess. The additional information changes the probability. This is an example of a model called conditional probability, the probability of one thing happening under the condition that something else also happened. Conditional probability allows us to better estimate probabilities by using this additional information. Conditional probabilities are common in everyday life. For example, home insurance rates are tailored to the di ering conditional probabilities of insurance claims (e.g., premiums are higher in coastal Florida, where hurricane damage is more likely, relative to where we live in Pennsylvania). Similarly, genetic testing can tell you if you are at higher risk for certain diseases; women with abnormal BRCA1 or BRCA2 genes have up to an 80 percent risk of developing breast cancer by age ninety. Conditional probability is denoted with a | symbol. For example, the probability (P) that you will get breast cancer by age ninety given that you are a woman with a BRCA mutation would be denoted as P(breast cancer by ninety | woman with BRCA mutation). Some people nd conditional probabilities confusing. They mix up the probability that an event A will happen given a condition that event B happened—P(A|B)—with the probability that an event B will happen given the condition that event A happened—P(B|A). This is known as the inverse fallacy, whereby people think that P(A|B) and P(B|A) must have similar probabilities. While you just saw that P(breast cancer by ninety | woman with BRCA mutation) is about 80 percent, by contrast P(woman with BRCA mutation | breast cancer by ninety) is only 5 to 10 percent, because many other people develop breast cancer who do not have these mutations. Let’s walk through a longer example to see this fallacy in action. Suppose the police pull someone over at random at a drunk-driving checkpoint and administer a Breathalyzer test that indicates they are drunk. Further, suppose the test is wrong on average 5 percent of the time, saying that a sober person is drunk. What is the probability that this person is wrongly accused of drunk driving? Your rst inclination might be to say 5 percent. However, you have been given the probability that the test says someone is drunk given they are sober, or P(Test=drunk | Person=sober) = 5 percent. But what you have been asked for is the probability that the person is sober given that the test says they are drunk, or P(Person=sober | Test=drunk). These are not the same probabilities! What you haven’t considered is how the results depend on the base rate of the percentage of drunk drivers. Consider the scenario where everyone makes the right decision, and no one ever drives drunk. In this case the probability that a person is sober is 100 percent, regardless of what the Breathalyzer test results say. When a probability calculation fails to account for the base rate (like the base rate of drunk drivers), the mistake that is made is called the base rate fallacy. Let’s consider a more realistic base rate, where one in a thousand drivers is drunk, meaning that there is a small chance (0.1 percent) that a person the police randomly pull over is drunk. And since we know one in twenty tests will be wrong (the tests will be wrong 5 percent of the time), the police will most likely go through a lot of wrong tests before they nd a person who was actually drunk-driving. In fact, if the police stop a thousand people, they would on average conduct nearly fty wrong tests along their way to nding one actual drunk driver. So there is approximately only a 2 percent chance that a failed Breathalyzer in this scenario indicates that the person is actually drunk. Alternatively, this can be stated as a 98 percent chance that the person is sober. That’s way, way more than 5 percent! So, P(A|B) does not equal P(B|A), but how are they related? There is a very useful result in probability called Bayes’ theorem, which tells us the relationship between these two conditional probabilities. On the next page, you will see how Bayes’ theorem relates these probabilities and how, in the drunk-driving example, Bayes’ theorem could be applied to calculate the 2 percent result. Bayes’ Theorem Base Rate Fallacy Now that you know about Bayes’ theorem, you should also know that there are two schools of thought in statistics, based on di erent ways to think about probability: Frequentist and Bayesian. Most studies you hear about in the news are based on frequentist statistics, which relies on and requires many observations of an event before it can make reliable statistical determinations. Frequentists view probability as fundamentally tied to the frequency of events. By observing the frequency of results over a large sample (e.g., asking a large number of people if they approve of Congress), frequentists estimate an unknown quantity. If there are very few data points, however, they can’t say much of anything, since the con dence intervals they can calculate will be extremely large. In their view, probability without observations makes no sense. Bayesians, by contrast, allow probabilistic judgments about any situation, regardless of whether any observations have yet occurred. To do this, Bayesians begin by bringing related evidence to statistical determinations. For example, picking a penny up o the street, you’d probably initially estimate a fty- fty chance that it would come up heads if you ipped it, even if you’d never observed a ip of that particular coin before. In Bayesian statistics, you can bring such knowledge of base rates to a problem. In frequentist statistics, you cannot. Many people nd this Bayesian way of looking at probability more intuitive because it is similar to how your beliefs naturally evolve. In everyday life, you aren’t starting from scratch every time, as you would in frequentist statistics. For instance, on policy issues, your starting point is what you currently know on that topic—what Bayesians call a prior—and then when you get new data, you (hopefully) update your prior based on the new information. The same is true for relationships, with your starting point being your previous experiences with that person; a strong prior would be a lifelong relationship, whereas a weak prior would be just a rst impression. You saw in the last section that frequentist statistics produce con dence intervals. These statistics tell you that if you ran an experiment many times (e.g., the one-hundred-coin- ips example we presented), the con dence intervals calculated should contain the parameter you are studying (e.g., 50 percent probability of getting heads) to the level of con dence speci ed (e.g., 95 percent of the time). To many people’s dismay, a con dence interval does not say there is a 95 percent chance of the true value of the parameter being in the interval. By contrast, Bayesian statistics analogously produces credible intervals, which do say that; credible intervals specify the current best estimated range for the probability of the parameter. As such, this Bayesian way of doing things is again more intuitive. In practice, though, both approaches yield very similar conclusions, and as more data becomes available, they should converge on the same conclusion. That’s because they are both trying to estimate the same underlying truth. Historically, the frequentist viewpoint has been more popular, in large part because Bayesian analysis is often computationally challenging. However, modern computing power is quickly reducing this challenge. Bayesians contend that by choosing a strong prior, they can start closer to the truth, allowing them to converge on the nal result faster, with fewer observations. As observations are expensive in both time and money, this reduction can be attractive. However, there is a ip side: it is also possible that Bayesians’ prior beliefs are actually doing the opposite, starting them further from the truth. This can happen if they have a strong belief that is based on con rmation bias (see Chapter 1) or another cognitive mistake (e.g., an unjusti ed strong prior). In this case, the Bayesian approach may take longer to converge on the truth because the frequentist view (starting from scratch) is actually closer to the truth at the start. The takeaway is that two ways of approaching statistics exist, and you should be aware that, done right, both approaches are valid. Some people are hard-core ideologues who pledge allegiance to one philosophy versus the other, whereas pragmatists (like us) use whichever methodology works best for the situation. And more commonly, remember not to confuse a conditional probability with its inverse: P(A|B) is not equal to P(B|A). You now know that these probabilities are related by Bayes’ theorem, which takes into account relevant base rates. RIGHT OR WRONG? So far you have learned that you shouldn’t base your decisions on anecdotes and that small samples cannot reliably tell you what will happen in larger populations. You might be wondering, then: How much data is enough data to be sure of my conclusions? Deciding the sample size, the total number of data points collected, is a balancing act. On one side, the more information you collect, the better your estimates will be, and the more sure you can be of your conclusions. On the other side is the fact that gathering more data takes more time and more money, and potentially puts more participants at risk. So, how do you know what the right sample size is? That’s what we cover in this section. Even with the best experimental design, sometimes you get a uke result that leads you to draw the wrong conclusions. A higher sample size will give you more con dence that a positive result is not just a uke and will also give you a greater chance of detecting a positive result. Consider a typical polling situation, such as measuring public support for an upcoming referendum, e.g., marijuana legalization. Suppose that the referendum ultimately fails, but the pollsters had randomly selected as their respondents people who were more in favor of it when compared with the whole population. This situation could result in a false positive: falsely giving a positive result when it really wasn’t true (like the wrong Breathalyzer test). Conversely, suppose the referendum ultimately succeeds, but the pollsters had randomly selected people less in favor of it when compared with the whole population. This situation could result in a false negative: falsely giving a negative result when it really was true. As another example, consider a mammogram, a medical test used in the diagnosis of breast cancer. You might think a test like this has two possible results: positive or negative. But really a mammogram has four possible outcomes, depicted in the following table. The two possible outcomes you immediately think of are when the test is right, the true positive and the true negative; the other two outcomes occur when the test is wrong, the false positive and the false negative. Possible Test Outcomes Results of mammogram Evidence of cancer No evidence of cancer Patient has breast cancer True positive False negative Patient does not have breast cancer False positive True negative These error models occur well beyond statistics, in any system where judgments are made. Your email spam lter is a good example. Recently our spam lters agged an email with photos of our new niece as spam (false positive). And actual spam messages still occasionally make it through our spam lters (false negatives). Because making each type of error has consequences, systems need to be designed with these consequences in mind. That is, you have to make decisions on the trade-o between the di erent types of error, recognizing that some errors are inevitable. For instance, the U.S. legal system is supposed to require proof beyond a reasonable doubt for criminal convictions. This is a conscious trade-o favoring false negatives (letting criminals go free) over false positives (wrongly convicting people of crimes). In statistics, a false positive is also known as a type I error and a false negative is also called a type II error. When designing an experiment, scientists get to decide on the probability of each type of error they are willing to tolerate. The most common false positive rate chosen is 5 percent. (This rate is also denoted by the Greek letter α, alpha, which is equal to 100 minus the con dence level. This is why you typically see people say a con dence level of 95 percent.) That means that, on average, if your hypothesis is false, one in twenty experiments (5 percent) will get a false positive result. Regardless of the sample size of your experiment, you can always choose the false positive error rate. It doesn’t have to be 5 percent; you could choose 1 percent or even 0.1 percent. The catch is that, for a given sample size, when you do set such a low false positive rate, you increase your false negative error rate, possibly failing to detect a real result. This is where the sample size selection comes in. Once you set your false positive rate, you then determine what sample size you need in order to detect a real result with a high enough probability. This value, called the power of the experiment, is typically selected to be an 80 to 90 percent chance of detection, with a corresponding false negative error rate of 10 to 20 percent. (This rate is also denoted by the Greek letter β, beta, which is equal to 100 minus the power.) Researchers say their study is powered at 80 percent. Statistical Testing Nothing to detect Something to detect False positive rate (%) Power (%) Detected e ect aka type 1 error rate, alpha (α) or 100 - false negative rate signi cance level Typical level: 80%-90% Typical rate: 5% Con dence level (%) False negative rate (%) Did not detect e ect 100 - false positive rate aka type II error rate or beta (β) Typical level: 95% Typical rate: 10%-20% Let’s consider an example to illustrate how all these models work together. Suppose a company wants to prove that its new sleep meditation app is working. Their background research shows that half of the time, the average person falls asleep within ten minutes. The app developers think that their app can improve this rate, helping more people fall asleep in less than ten minutes. The developers plan a study in a sleep lab to test their theory. The test group will use their app and the control group will just go to sleep without it. (A real study might have a slightly more complicated design, but this simple design will let us better explain the statistical models.) The statistical setup behind most experiments (including this one) starts with a hypothesis that there is no di erence between the groups, called the null hypothesis. If the developers collect su cient evidence to reject this hypothesis, then they will conclude that their app really does help people fall asleep faster. That is, the app developers plan to observe both groups and then calculate the percentage of people who fall asleep within ten minutes for each group. If they see enough of a di erence between the two percentages, they will conclude that the results are not compatible with the null hypothesis, which would mean their app is likely really working. The developers also need to specify an alternative hypothesis, which describes the smallest meaningful change they think could occur between the two groups, e.g., 15 percent more people will fall asleep within ten minutes. This is the real result they want their study to con rm and have an 80 percent chance to detect (corresponding to a false negative rate of 20 percent). This alternative hypothesis is needed to determine the sample size. The smaller the di erence in the alternative hypothesis, the more people will be needed to detect it. With the experimental setup described, a sample size of 268 participants is required. All of these models come together visually in the gure on the next page. First, look at the bell curves. (Due to the central limit theorem, we can assume that our di erences will be approximately normally distributed.) The curve on the left is for the results under the null hypothesis: that there is no real di erence between the two groups. That’s why this left bell curve is centered on 0 percent. Even so, some of the time they’d measure a higher or lower di erence than zero due to random chance, with larger di erences being less likely. That is, due to the underlying variability, even if the app has no real e ect, they might still measure di erences between the two groups because of the variable times it takes for people to fall asleep. The other bell curve (on the right) represents the alternative hypothesis that the app developers hope to be true: that there is a 15 percent increase in the percentage of people who fall asleep within ten minutes using the app as compared with people not using the app. Again, even if this hypothesis were true, due to variability, some of the time they’d still measure less than a 15 percent increase, and some of the time more than a 15 percent increase. That’s why the right bell curve is centered on 15 percent. Statistical Signi cance Alpha: 5% Beta: 20% Sample size: 268 The dotted line represents the threshold for statistical signi cance. All values larger than this threshold (to the right) would result in rejection of the null hypothesis because di erences this large are very unlikely to have occurred if the null hypothesis were true. In fact, they would occur with less than a 5 percent chance—the false positive rate initially set by the developers. The nal measure commonly used to declare whether a result is statistically signi cant is called the p-value, which is formally de ned as the probability of obtaining a result equal to or more extreme than what was observed, assuming the null hypothesis was true. Essentially, if the p-value is smaller than the selected false positive rate (5 percent), then you would say that the result is statistically signi cant. P-values are commonly used in study reports to communicate such signi cance. For example, a p-value of 0.01 would mean that a di erence equal to or larger than the one observed would happen only 1 percent of the time if the app had no e ect. This value corresponds to a value on the gure in the extreme tail of the left bell curve and close to the middle of the right bell curve. This placement indicates that the result is more consistent with the alternative hypothesis, that the app has an e ect of 15 percent. Now, notice how these two curves overlap, showing that some di erences between the two groups are consistent with both hypotheses (under both bell curves simultaneously). These gray areas show where the two types of error can occur. The light gray area is the false positive region and the dark gray area is the false negative region. A false positive would occur when a large di erence is measured between the two groups (like one with a p-value of 0.01), but in reality, the app does nothing. This could happen if the no-app group randomly had trouble falling asleep and the app group randomly had an easy time. Alternatively, a false negative would occur when the app really does help people fall asleep faster, but the di erence observed is too small to be statistically signi cant. If the study is 80 percent powered, which is typical, this false negative scenario would occur 20 percent of the time. Assuming the sample size remains xed, lowering the chance of making a false positive error is equivalent to moving the dotted line to the right, shrinking the light gray area. When you do so, though, the chance of making a false negative error grows (depicted in the following gure as compared with the original). Statistical Signi cance Alpha: 2% Beta: 33% Sample size: 268 If you want to reduce one of the error rates without increasing the other, you need to increase the sample size. When that happens, each of the bell curves becomes narrower (see the gure below, again as compared to the original). Statistical Signi cance Alpha: 5% Beta: 12% Sample size: 344 Increasing the sample size and narrowing the bell curves decreases the overlap between the two curves, shrinking the total gray area in the process. This is of course attractive because there is less chance of making an error; however, as we noted in the beginning of this section, there are many reasons why it may not be practical to increase the sample size (time, money, risk to participants, etc.). The table on the next page illustrates how sample size varies for di erent limits on the error rates for the sleep app study. You will see that if error rates are decreased, the sample size must be increased. The sample size values in the following table are all dependent on the selected alternative hypothesis of a 15 percent di erence. The sample sizes would all further increase if the developers wanted to detect a smaller di erence and would all decrease if they wanted to detect only a larger di erence. Researchers often feel pressure to use a smaller sample size in order to save time and money, which can make choosing a larger di erence for the alternative hypothesis appealing. But such a choice comes at a high risk. For instance, the developers can reduce their sample size to just 62 (from 268) if they change the alternative hypothesis to a 30 percent increase between the two groups (up from 15 percent). Sample Size Varies with Power and Signi cance Alpha Con dence level Beta Power Sample size 10% 90% 20% 80% 196 10% 90% 10% 90% 284 5% 95% 30% 70% 204 5% 95% 20% 80% 268 5% 95% 10% 90% 370 1% 99% 20% 80% 434 1% 99% 10% 90% 562 However, if the true di erence the app makes is really only 15 percent, with this smaller sample size they will be able to detect this smaller di erence only 32 percent of the time! That’s down from 80 percent originally and means that two-thirds of the time they’d get a false negative, failing to detect the 15 percent di erence. As a result, ideally any experiment should be designed to detect the smallest meaningful di erence. One nal note on p-values and statistical signi cance: Most statisticians caution against overreliance on p-values in interpreting the results of a study. Failing to nd a signi cant result (a su ciently small p-value) is not the same as having con dence that there is no e ect. The absence of evidence is not the evidence of absence. Similarly, even though the study may have achieved a low p-value, it might not be a replicable result, which we will explore in the nal section. Statistical signi cance should not be confused with scienti c, human, or economic signi cance. Even the most minuscule e ects can be detected as statistically signi cant if the sample size is large enough. For example, with enough people in the sleep study, you could potentially detect a 1 percent di erence between the two groups, but is that meaningful to any customers? No. Alternatively, more emphasis could be placed on the di erence measured in a study along with its corresponding con dence interval. For the app study, while the customers want to know that they have better chances of falling asleep with the app than without, they also want to know how much better. The developers might even want to increase the sample size in order to be able to guarantee a certain margin of error in their estimates. Further, the American Statistical Association stressed in The American Statistician in 2016 that “scienti c conclusions and business or policy decisions should not be based only on whether a p-value passes a speci c threshold.” Focusing too much on the p-value encourages black-and-white thinking and compresses the wealth of information that comes out of a study into just one number. Such a singular focus can make you overlook possible suboptimal choices in a study’s design (e.g., sample size) or biases that could have crept in (e.g., selection bias). WILL IT REPLICATE? By now you should know that some experimental results are just ukes. In order to be sure a study result isn’t a uke, it needs to be replicated. Interestingly, in some elds, such as psychology, there has been a concerted e ort to replicate positive results, but those e orts have found that fewer than 50 percent of positive results can be replicated. That rate is low, and this problem is aptly positive results the replication crisis. This nal section o ers some models to explain how this happens, and how you can nevertheless gain more con dence in a research area. Replication e orts are an attempt to distinguish between false positive and true positive results. Consider the chances of replication in each of these two groups. A false positive is expected to replicate—that is, a second false positive is expected to occur in a repetition of the study—only 5 percent of the time. On the other hand, a true positive is expected to replicate 80 to 90 percent of the time, depending on the power of the replication study. For the sake of argument, let’s assume this is 80 percent as we did in the last section. Using those numbers, a replication rate of 50 percent requires about 60 percent of the studies to have been true positives and 40 percent of them to have been false positives. To see this, consider 100 studies: If 60 were true positives, we would expect 48 of those to replicate (80 percent of 60). Of the remaining 40 false positives, 2 would replicate (5 percent of 40) for a total of 50. The replication rate would then be 50 per 100 studies, or 50 percent. Replication Crisis Re-test 100 Studies So, under this scenario, about a fourth of the failed replications (12 of 50) are explained by a lack of power in the replication e orts. These are real results that would likely be replicated successfully either if an additional replication study were done or if the original replication study had a higher sample size. The rest of the results that failed to replicate should have never been positive results in the rst place. Many of these original studies probably underestimated their type I error rate, increasing their chances of being a false positive. That’s because when a study is designed for a 5 percent chance of a false positive, that chance applies only to one statistical test, but very rarely is only one statistical test conducted. The act of running additional tests to look for statistically signi cant results has many names, including data dredging, shing, and p-hacking (trying to hack your data looking for small enough p-values). Often this is done with the best of intentions, as seeing data from an experiment can be illuminating, spurring a researcher to form new hypotheses. The temptation to test these additional hypotheses is strong, since the data needed to analyze them has already been collected. The trouble comes in, though, when a researcher overstates results that arise from these additional tests. The XKCD cartoon on this page illustrates how data dredging can play out: when no statistically signi cant relationship was found between jelly beans and acne, the scientists proceeded to dredge through twenty-one subgroups until one with a su ciently low p-value was found, resulting in the headline “Green Jelly Beans Linked to Acne!” Each time another statistical test was done, the chance of forming an erroneous conclusion continued to grow above 5 percent. To see this, suppose you had a twenty-sided die. The chances of making a mistake on the rst test would be the same as the chances of rolling a one. Each additional test run would be another roll of the die, each with another one- in-twenty chance of rolling a one. After twenty-one rolls (matching the twenty-one jelly bean colors in the comic), there is about a two-thirds chance that a one is rolled at least once, i.e., that there was at least one erroneous result. If this type of data dredging happens routinely enough, then you can see why a large number of studies in the set to be replicated might have been originally false positives. In other words, in this set of one hundred studies, the base rate of false positives is likely much larger than 5 percent, and so another large part of the replication crisis can likely be explained as a base rate fallacy. Unfortunately, studies are much, much more likely to be published if they show statistically signi cant results, which causes publication bias. Studies that fail to nd statistically signi cant results are still scienti cally meaningful, but both researchers and publications have a bias against them for a variety of reasons. For example, there are only so many pages in a publication, and given the choice, publications would rather publish studies with signi cant ndings over ones with none. That’s because successful studies are more likely to attract attention from media and other researchers. Additionally, studies showing signi cant results are more likely to contribute to the careers of the researchers, where publication is often a requirement to advance. Therefore, there is a strong incentive to nd signi cant results from experiments. In the cartoon, even though the original hypothesis didn’t show a signi cant result, the experiment was “salvaged” and eventually published because a secondary hypothesis was found that did show a signi cant result. The publication of false positives like this directly contributes to the replication crisis and can delay scienti c progress by in uencing future research toward these false hypotheses. And the fact that negative results aren’t always reported can also lead to di erent people testing the same negative hypotheses over and over again because no one knows other people have tried them. There are also many other reasons a study might not be replicable, including the various biases we’ve discussed in previous sections (e.g., selection bias, survivorship bias, etc.), which could have crept into the results. Another reason is that, by chance, the original study might have showcased a seemingly impressive e ect, when in reality the e ect is much more modest (regression to the mean). If this is the case, then the replication study probably does not have a large enough sample size (isn’t su ciently powered) to detect the small e ect, resulting in a failed replication of the study. There are ways to overcome these issues, such as the following: Using lower p-values to properly account for false positive error in the original study, across all the tests that are conducted Using a larger sample size in a replication study to be able to detect a smaller e ect size Specifying statistical tests to run ahead of time to avoid p- hacking Nevertheless, as a result of the replication crisis and the reasons that underlie it, you should be skeptical of any isolated study, especially when you don’t know how the data was gathered and analyzed. More broadly, when you interpret a claim, it is important to evaluate critically any data that backs up that claim: Is it from an isolated study or is there a body of research behind the claim? If so, how were the studies designed? Have all biases been accounted for in the designs and analyses? And so on. Many times, this investigation will require some digging. Media sources can draw false conclusions and rarely provide the necessary details to allow you to understand the full design of an experiment and evaluate its quality, so you will usually need to consult the original scienti c publication. Nearly all journals require a full section describing the statistical design of a study, but given the word constraints of a typical journal article, details are sometimes left out. Look for longer versions or related presentations on research websites. Researchers are also generally willing to answer questions about their research. In the ideal scenario, you would be able to nd a body of research made up of many studies, which would eliminate doubts as to whether a certain result was a chance occurrence. If you are lucky, someone has already published a systematic review about your research question. Systematic reviews are an organized way to evaluate a research question using the whole body of research on a certain topic. They de ne a detailed and comprehensive (systematic) plan for reviewing study results in an area, including identifying and nding relevant studies in order to remove bias from the process. Some but not all systematic reviews include meta-analyses, which use statistical techniques to combine data from several studies into one analysis. The data-driven reporting site FiveThirtyEight is a good example; it conducts meta-analyses across polling data to better predict political outcomes. There are advantages to meta-analyses, as combining data from multiple studies can increase the precision and accuracy of estimates, but they also have their drawbacks. For example, it is problematic to combine data across studies where the designs or sample populations vary too much. They also cannot eliminate biases from the original studies themselves. Further, both systematic reviews and meta-analyses can be compromised by publication bias because they can include only results that are publicly available. Whenever we are looking at the validity of a claim, we rst look to see whether a thorough systematic review has been conducted, and if so, we start there. After all, systematic reviews and meta-analyses are commonly used by policy makers in decision making, e.g., in developing medical guidelines. If one thing is clear from this chapter, it’s probably that designing good experiments is tough! We hope you’ve also gathered that probability and statistics are useful tools for better understanding problems that involve uncertainty. However, as this section should also make clear, statistics is not a magical cure for uncertainty. As statistician Andrew Gelman suggested in The American Statistician in 2016, we must “move toward a greater acceptance of uncertainty and embracing of variation.” More generally, keep in mind that while statistics can help you obtain con dent predictions across a variety of circumstances, it cannot accurately predict what will occur in an individual event. For instance, you may know that the average summer day is sunny and warm at your favorite beach spot, but that is no guarantee that it won’t be rainy or unseasonably cool the week you plan to take o from work. Similarly, medical research tells you that your risk of getting lung cancer increases if you smoke, and while you can estimate the con dence interval that an average smoker will get lung cancer in their lifetime, probability and statistics can’t tell you what speci cally will happen for any one individual smoker. While probability and statistics aren’t magic, they do help you better describe your con dence around the likelihood of various outcomes. There are certainly lots of pitfalls to watch out for, but we hope you also take away the fact that research and data are more useful for navigating uncertainty than hunches and opinions. KEY TAKEAWAYS Avoid succumbing to the gambler’s fallacy or the base rate fallacy. Anecdotal evidence and correlations you see in data are good hypothesis generators, but correlation does not imply causation—you still need to rely on well-designed experiments to draw strong conclusions. Look for tried-and-true experimental designs, such as randomized controlled experiments or A/B testing, that show statistical signi cance. The normal distribution is particularly useful in experimental analysis due to the central limit theorem. Recall that in a normal distribution, about 68 percent of values fall within one standard deviation, and 95 percent within two. Any isolated experiment can result in a false positive or a false negative and can also be biased by myriad factors, most commonly selection bias, response bias, and survivorship bias. Replication increases con dence in results, so start by looking for a systematic review and/or meta-analysis when researching an area. Always keep in mind that when dealing with uncertainty, the values you see reported or calculate yourself are uncertain themselves, and that you should seek out and report values with error bars! 6 Decisions, Decisions IF YOU COULD KNOW HOW your decisions would turn out, decision making would be so easy! It is hard because you have to make decisions with imperfect information. Suppose you are thinking of making a career move. You have a variety of next steps to consider: You could look for the same job you’re doing now, though with some better attributes (compensation, location, mission of organization, etc.). You could try to move up the professional ladder at your current job. You could move to a similar organization at a higher position. You could switch careers altogether, starting by going back to school. There are certainly more options. When you dig into them all, the array of choices seems endless. And you won’t be able to try any of them out completely before you commit to one. Such is life. How do you make sense of it all? The go-to framework for most people in situations like this is the pro-con list, where you list all the positive things that could happen if the decision was made (the pros), weighing them against the negative things that could happen (the cons). While useful in some simple cases, this basic pro-con methodology has signi cant shortcomings. First, the list presumes there are only two options, when as you just saw there are usually many more. Second, it presents all pros and cons as if they had equal weight. Third, a pro-con list treats each item independently, whereas these factors are often interrelated. A fourth problem is that since the pros are often more obvious than the cons, this disparity can lead to a grass-is-greener mentality, causing you mentally to accentuate the positives (e.g., greener grass) and overlook the negatives. As an example, in 2000, Gabriel nished school and began a career as an entrepreneur. Early on, at times, he considered switching to a career in venture capital, where he would fund and support companies instead of starting his own. When he initially made a pro-con list, this seemed like a great idea. There were many pros (the chance to work with founders changing the world, the potential for extremely high compensation, the opportunity to work on startups in a high-leverage way without the risk and stress of being the founder, etc.) and no obvious cons. However, there were several cons that he just didn’t fully appreciate or just didn’t know about yet (the relentless socializing involved—not good for a major introvert—the burden of having to constantly say no to people, the di culty of breaking into the eld, the fact that much of your time is spent with struggling companies, etc.). While certainly a great career for some who get the opportunity, venture capital was not a good t for Gabriel, even if he didn’t realize it at rst. With more time and experience, the full picture has become clear (the grass isn’t greener, at least for him), and he has no plans to make that career change. This anecdote is meant to illustrate that it is inherently di cult to create a complete pro-con list when your experience is limited. Other mental models in this chapter will help you approach situations like these with more objectivity and skepticism, so you can uncover the complete picture faster and make sense of what to do about it. You’ve probably heard the phrase If all you have is a hammer, everything looks like a nail. This phrase is called Maslow’s hammer and is derived from this longer passage by psychologist Abraham Maslow in his 1966 book The Psychology of Science: I remember seeing an elaborate and complicated automatic washing machine for automobiles that did a beautiful job of washing them. But it could do only that, and everything else that got into its clutches was treated as if it were an automobile to be washed. I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. The hammer of decision-making models is the pro-con list; useful in some instances, but not the optimal tool for every decision. Luckily, there are other decision-making models to help you e ciently discover and evaluate your options and their consequences across a variety of situations. As some decisions are complex and consequential, they demand more complicated mental models. In simpler cases, applying these sophisticated models would be overkill. It is best, however, to be aware of the range of mental models available so that you can pick the right tool for any situation. WEIGHING THE COSTS AND BENEFITS One simple approach to improving the pro-con list is to add some numbers to it. Go through each of your pros and cons and put a score of −10 to 10 next to it, indicating how much that item is worth to you relative to the others (negatives for cons and positives for pros). When considering a new job, perhaps location is much more important to you than a salary adjustment? If so, location would get a higher score. Scoring in this way helps you overcome some of the pro-con list de ciencies. Now each item isn’t treated equally anymore. You can also group multiple items together into one score if they are interrelated. And you can now more easily compare multiple options: simply add up all the pros and cons for each option (e.g., job o ers) and see which one comes out on top. This method is a simple type of cost-bene t analysis, a natural extension of the pro-con list that works well as a drop-in replacement in many situations. This powerful mental model helps you more systematically and quantitatively analyze the bene ts (pros) and costs (cons) across an array of options. For simple situations, the scoring approach just outlined works well. In the rest of this section, we explain how to think about cost-bene t analysis in more complicated situations, introducing a few other mental models you will need to do so. Even if you don’t use sophisticated cost-bene t analysis yourself, you will want to understand how it works because this method is often used by governments and organizations to make critical decisions. (Math warning: because numbers are involved, there is a bit of arithmetic needed.) The rst change when you get more sophisticated is that instead of putting relative scores next to each item (e.g., −10 to 10), you start by putting explicit dollar values next to them (e.g., −$100, +$5,000, etc.). Now when you add up the costs and bene ts, you will end up with an estimate of that option’s worth to you in dollars. For example, when considering the option of buying a house, you would start by writing down what you would need to pay out now (your down payment, inspection, closing costs), what you would expect to pay over time (your mortgage payments, home improvements, taxes... the list goes on), and what you expect to get back when you sell the house. When you add those together, you can estimate how much you stand to gain (or lose) in the long term. As with pro-con lists, it is still hard to account for every cost and bene t in a cost-bene t analysis. However, it is important to note that this model works well only if you are thorough, because you will use that nal number to make decisions. One useful tactic is to talk to people who have made similar decisions and ask them to point out costs or bene ts that you may have missed. For instance, by talking to other homeowners, you might learn about maintenance costs you didn’t fully consider (like how often things break, removing dead trees, etc.). Longtime homeowners can easily rattle o this hidden litany of costs (said with experience!). When writing down costs and bene ts, you will nd that some are intangible. Continuing the house example, when you buy a house, you might have some anxiety around keeping it up to date, and that anxiety can be an additional “cost.” Conversely, there may be intangible bene ts to owning a home, such as not having to deal with a landlord. In a cost-bene t analysis, when faced with intangibles like these, you still want to assign dollar values to them, even if they are just rough estimates of how much they are worth to you. Doing so will help you create a fair quantitative comparison between the courses of action you are considering. Writing down dollar values for intangible costs and bene ts can seem strange—how do you know what it’s worth to you to not have to deal with a landlord? But if you think about it, this is no di erent than scoring a pro- con list. In the scoring method, if the extra amount you’d have to pay monthly rated a −10 (out of 10) and landlord avoidance rated a +1 (out of 10), then you have a quick way to start an estimate: just take the extra payment amount and divide it by 10. Say the excess monthly payments are expected to be $1,000 per month; then you could estimate it is worth $100 per month to avoid a landlord. Of course, you can pick any numbers that make sense to you. You can get hung up here because it can feel arbitrary to write down speci c values for things that you don’t know exactly. However, you should know that doing so truly helps your analysis. The reason is that you really do have some sense for how valuable things are and putting that (even inexact) sense into your analysis will improve your results. And, as we will see in a moment, there is a method for testing how much these values are in uencing your results. So far, you’ve moved from scoring to dollar values. Next, you graduate to a spreadsheet! Instead of a column of costs and a column of bene ts, now you want to arrange the costs and bene ts on a timeline. Give each item its own row, and each column in the timeline will now list the cost or bene t created by that item in a given year. So, the rst column holds all the costs and bene ts you expect this year (in year 0), the next column in year 1, then year 2, and so on. The row for a $2,000-per-month mortgage payment would look like −$24,000, −$24,000, −$24,000, for as many years as the life of the mortgage. The reason it is important to lay out the costs and bene ts over time in this manner (in addition to increased clarity) is that bene ts you get today are worth more than those same bene ts later. There are three reasons for this that are important to appreciate, so please excuse the tangent; back to the cost-bene t analysis in a minute. First, if you receive money (or another bene t) today, you can use it immediately. This opens up opportunities for you that you wouldn’t otherwise have. For instance, you could invest those funds right now and be receiving a return on that money via a di erent investment, or you could use the funds for additional education, investing in yourself. (See opportunity cost of capital in Chapter 3.) Second, most economies have some level of in ation, which describes how, over time, prices tend to increase, or in ate. As a result, your money will have less purchasing power in the future than it does today. When we were younger, the standard price for a slice of pizza was one dollar; now a slice will run you upward of three dollars! That’s in ation. Because of in ation, if you get one hundred dollars ten years from now, you won’t be able to buy as much as you could if you had the same amount of money today. Consequently, you don’t want to regard an amount of money in ten years as the equivalent amount of money available today. Third, the future is uncertain, and so there is risk that your predicted bene ts and costs will change. For instance, bene ts that depend on currencies, stock markets, and interest rates will uctuate in value, and the further you go into the future, the harder they are to predict. Now back to cost-bene t analysis. As you recall, you have a spreadsheet that lays out current and future costs and bene ts across time. To account for the di erences in value between current and future bene ts, you use a mental model we introduced back in Chapter 3: the discount rate. You simply discount future bene ts (and costs) when comparing them to today. Let’s walk through an example to show you how it works. Cost-bene t analysis is arguably most straightforward with simple investments, so let’s use one. Bonds are a common investment option, which operate like a loan: you invest (loan) money today and expect to get back more money in the future when the bond matures (is due). Suppose you invest $50,000 in a bond, which you expect to return $100,000 in ten years. Feel free to make a spreadsheet and follow along. Cost-Bene t Analysis Time line Year 0 Year 1 Year 2 Year 3 Year 4... Year 10 Costs $(50,000) — — — —... — Bene ts — — — — —... $100,000 Discounted (6%) $55,859... Net bene t $5,839... The only cost today (year 0) is $50,000, to purchase the bond. The only bene t in the future (year 10) is $100,000, what you get back when the bond matures. However, as noted, that bene t is not actually worth $100,000 in today’s dollars. You need to discount this future bene t back to what it is worth today. Using a discount rate of 6 percent (relatively appropriate for this situation—more on that in a bit), you can use a net present value calculation (again see Chapter 3 if you need a refresher) to translate the bene t of $100,000 in ten years into today’s dollars given the 6 percent discount rate. The formula is $100,000/1.0610 and you get the result of $55,839. That’s all you need for a relatively sophisticated cost-bene t analysis right there! To nish the analysis, just add up all the discounted costs and bene ts in today’s dollars. You have the discounted bene t of $55,839 minus the initial cost of $50,000, netting you $5,839. You want the net bene t to be positive or else the deal isn’t worth doing, since you’d end up worse o (in today’s dollars). In this case, the net bene t is positive, so the investment is worth considering among your other options. A central challenge with cost-bene t analysis is that this end result is sensitive to the chosen discount rate. One way to show this sensitivity is through a sensitivity analysis, which is a useful method to analyze how sensitive a model is to its input parameters. Using the $50,000 bond example, let’s run a sensitivity analysis on the discount rate. To do so, you just vary the discount rate and calculate the net bene t for each variation. Sensitivity Analysis Discount rate Net bene t 0% $50,000 2% $32,033 4% $17,556 6% $5,839 8% -$3,680 10% -$11,446 12% -$17,803 14% -$23,026 16% -$27,332 Notice how a seemingly small di erence in the discount rate can represent a huge di erence in the net bene t. That is, the net bene t is very sensitive to the discount rate. While the net bene t is positive at a 6 percent discount rate, it is three times more positive at 4 percent, and it is negative at 8 percent. That’s because at higher discount rates, the future bene t is discounted more. Eventually, it is discounted so much that the net bene t drops into negative territory. Running a sensitivity analysis like this can give you an idea of a range of net bene ts you can expect under reasonable discount rates. You should similarly run a sensitivity analysis on any input parameter about which you are uncertain so that you can tell how much it in uences the outcome. Recall how earlier we discussed the di culties around putting dollar values to intangible costs and bene ts, such as how much not having a landlord is worth. You could use sensitivity analysis to test how much that input parameter matters to the outcome, and how a range of reasonable values would directly in uence the outcome. In general, sensitivity analysis can help you quickly uncover the key drivers in your spreadsheet inputs and show you where you may need to spend more time to develop higher accuracy in your assumptions. Sensitivity analysis is also common in statistics, and we actually already presented another one in Chapter 5 when we showcased how sample size is sensitive to alpha and beta when designing experiments. Given that the discount rate is always a key driver in cost-bene t analyses, guring out a reasonable range for the discount rate is paramount. To do so, consider again the factors that underlie the discount rate: in ation (that the purchasing power of money can change over time), uncertainty (that bene ts may or may not actually occur), and opportunity cost of capital (that you could do other things with your money). Since these factors are situationally dependent, there is unfortunately no standard answer for what discount rate to use for any given situation. Governments typically use rates close to their interest rates, which normally move with in ation rates. Large corporations use sophisticated methods that account for their rates of borrowing money and the return on investment seen from previous projects, together resulting in a rate that is usually signi cantly higher than government interest rates. New businesses, which are highly speculative, should be using much higher discount rates still, since it costs them a lot to borrow money and they are often in a race against time before they run out of money or get eaten by competitors. Thus, the range of acceptable rates can vary widely, from close to the in ation rate all the way up to 50 percent or higher in an extremely high- risk/high-reward situation. One decent approach is to use the rate at which you can borrow money. You would want your investment returns to be higher than this rate or else you shouldn’t be borrowing money to invest. Note that this rate would typically have the in ation rate already built into it, since credit rates move with interest rates, which typically move with in ation. That is, people loaning you money also want to be protected from in ation, and so they usually build an expected in ation rate into their lending rates. As investments can look very di erent based on di erent discount rates, there are many open debates about which discount rates are most appropriate to use in di ering situations, especially when it comes to government programs. Di erent discount rates can favor one program over another, and so there can be a lot of pressure from di erent lobbying groups to choose a particular rate. Another problem occurs in situations where the costs or bene ts are expected to persist far into the future, such as with climate change mitigation. Because the e ects of the discount rate compound over time, even rather small rates discount far-future e ects close to zero. This has the e ect of not valuing the consequences to future generations, and some economists think that is unfair and potentially immoral. Even with this central issue around discount rate, cost-bene t analysis is an incredibly valuable model to frame a more quantitative discussion around how to proceed with a decision. As such, many governments mandate its use when evaluating policy options. In 1981, U.S. President Ronald Reagan signed Executive Order 12291, which mandated that “regulatory action shall not be undertaken unless the potential bene ts to society from the regulation outweigh the potential costs to society.’’ This language has been tweaked by subsequent U.S. presidents, though the central idea of it continues to drive policy, with the U.S. federal government conducting cost-bene t analyses for most signi cant proposed regulatory actions. One nal issue with cost-bene t analysis to keep in mind is the trickiness of comparing two options that have di erent time horizons. To illustrate this trap, let’s compare our theoretical bond investment from earlier to another bond investment. Our bond investment from before cost $50,000 and returned $100,000 in ten years, which at a 6 percent discount rate resulted in a net bene t in today’s dollars of $5,839. Our new investment will also be a $50,000 bond investment, though instead of returning $100,000 in ten years, it pays back $75,000 in just six years. The cost today (year 0) for this second bond is again −$50,000. Using the same 6 percent discount rate, the $75,000 bene t six years from now discounted back to today’s dollars would be worth $52,872, for a net bene t of $2,872 ($52,872 − $50,000). This net bene t is less than the net bene t of the rst bond investment opportunity of $5,839 and so it seems the rst bond is a better investment. However, if you purchased the second bond, your $75,000 would be freed up after six years, leaving you four more years to invest that money in another way. If you were able to invest that money in a new investment at a high enough rate, this second bond is potentially more attractive in the end. When making a comparison, you therefore must consider what could happen over the same time frame. In other words, cost-bene t analysis is only as good as the numbers you put into it. In computer science, there is a model describing this phenomenon: garbage in, garbage out. If your estimates of costs and bene ts are highly inaccurate, your timelines don’t line up, or your discount rate is poorly reasoned (garbage in), then your net result will be similarly awed (garbage out). On the other hand, if you take great care to make accurate estimates and perform relevant sensitivity analyses, then cost-bene t analysis can be a rst-rate model for framing a decision-making process, and is in most cases a desirable replacement for a pro-con list. Next time you make a pro-con list, at least consider the scoring method to turn it into a simple cost-bene t analysis. TAMING COMPLEXITY When you can list out your options for a decision, and their costs and bene ts are relatively clear, then cost-bene t analysis is a good starting point for approaching the decision. However, in many cases, your options and their associated costs and bene ts are not very clear. Sometimes there is too much uncertainty in potential outcomes; other times, the situation can be so complex that it becomes di cult even to understand your options in the rst place. In either case, you’ll need to use some other mental models to navigate such complexity. Consider a relatively common situation that homeowners face: the expensive repair. Suppose you want to repair your pool equipment before the summer swimming season. You get bids from two contractors. One bid is from your usual dependable pool service, but it seems high at $2,500. The second bid comes in at a lower cost of $2,000, though this contractor is a team of one, you don’t have a history with them, and they also seem like they might be a little out of their depth. As such, you get the impression that there is only a 50 percent chance that this contractor will nish at the quoted cost in a timely manner (in one week). If not, you estimate the following scenarios: A 25 percent chance that they will be one week late at an extra cost of $250 for the extra labor A 20 percent chance that they will be two weeks late at an extra cost of $500 A 5 percent chance that they will not only take longer than three weeks to complete the job, but also that some of their work will need to be redone, totaling extra costs of $1,000 This situation (multiple bids with timing/quality concerns) is very common, but because of the uncertainty introduced in the outcome, it’s a bit too complex to analyze easily with just cost-bene t analysis. Luckily, there is another straightforward mental model you can use to make sense of all these potential outcomes: the decision tree. It’s a diagram that looks like a tree (drawn on its side), and helps you analyze decisions with uncertain outcomes. The branches (often denoted by squares) are decision points and the leaves represent di erent possible outcomes (often using open circles to denote chance points). A decision tree that represents this pool situation could look like the gure below. Decision Tree The rst square represents your choice between the two contractors, and then the open circles further branch out to the di erent possible outcomes for each of those choices. The leaves with the closed circles list the resulting costs for each outcome, and their probabilities are listed on each line. (This is a simple probability distribution [see Chapter 5], which describes how all the probabilities are distributed across the possible outcomes. Each group of probabilities sums to 100 percent, representing all the possible outcomes for that choice.) You can now use your probability estimates to get an expected value for each contractor, by multiplying through each potential outcome’s probability with its cost, and then summing them all up. This resulting summed value is what you would expect to pay on average for each contractor, given all the potential outcomes. The expected value for your usual contractor (Contractor 2 in the decision tree) is just $2,500, since there is only one possible outcome. The expected value for the new contractor (Contractor 1 in the decision tree) is the sum of the multiplications across their four possible outcomes: $1,000 + $562.50 + $500 + $150 = $2,212.50. Even though the new contractor has an outcome that might cost you $3,000, the expected value you’d pay is still less than you’d pay your usual contractor. Expected Value What this means is that if these probabilities are accurate, and you could run the scenario one hundred times in the real world where you pick the new contractor each time, your average payment to them would be expected to be $2,212.50. That’s because half the time you’d pay only $2,000, and the other half, more. You’d never pay exactly $2,212.50, since that isn’t a possible outcome, but overall your payments would average out to that expected value over many iterations. If you nd this confusing, the following example might be helpful. In 2015, U.S. mothers had 2.4 kids on average. Does any particular mother have exactly 2.4 kids? We hope not. Some have one child, some two, some three, and so on, and it all averages out to 2.4. Likewise, the various contractor payment outcomes and their probabilities add up to the expected value payment amount, even though you never pay that exact amount. In any case, from this lens of the decision tree and the resulting expected values, you might rationally choose the new contractor, even with all their potential issues. That’s because your expected outlay is lower with that contractor. Of course, this result could change with di erent probabilities and/or potential outcome payments. For example, if you thought that, instead of a 5 percent chance for a $3,000 bill, there was a 50 percent chance you could end in this highest outcome, then the expected value for the new contractor would become higher than your usual contractor’s bid. Remember that you can always run a sensitivity analysis on any inputs that you think might signi cantly in uence the decision, as we discussed in the last section. Here you would vary the probabilities and/or potential outcome payments and see how the expected values change accordingly. Additionally, consider another way the decision could change. Suppose you’ve already scheduled a pool party a few weeks out. Now, if the lower- bid contractor pushes into that second week, you’re going to be faced with a lot of anxiety about your party. You will have to put pressure on the contractor to get the job done, and you might even have to bring in reinforcements to help nish the job at a much higher cost. That’s a lot of extra hassle. To a wealthier person who associates a high opportunity cost with their time, all this extra anxiety and hassle may be valued at an extra $1,000 worth of cost, even if you aren’t paying that $1,000 directly to the contractor. Accounting for this possible extra burden would move up the two-week-late outcome from $2,500 (previously a $500 overrun) to $3,500 (now a $1,500 overrun). Similarly, if this new contractor really messes up the job and you do have to bring in your regular contractor to do most everything over again on short notice, it will cost you the extra $1,000 in anxiety and hassle, as well as literally more payment to the other contractor. So, that small 5 percent chance of a $3,000 outcome might end up costing the equivalent of an extra $2,000, moving it to $5,000 in total. By using these increased values in your decision tree, you can e ectively “price in” the extra costs. Because these new values include more than the exact cost you’d have to pay out, they are called utility values, which re ect your total relative preferences across the various scenarios. We already saw this idea in the last section when we discussed putting a price to the preference of not having a landlord. This is the mental model that encapsulates the concept. Utility values can be disconnected from actual prices in that you can value something more than something else, even though it costs the same on the open market. Think about your favorite band—it’s worth more to you to see them in concert than another band that o ers their concerts at the same price, simply because you like them more. You would get more utility out of that concert because of your preference. In the pool case, the stress involved with scrambling to x the pool before your party is an extra cost of lost utility in addition to the actual cost you would have to pay out to the contractors. In terms of the decision tree, the outcome values for the leaves can become the utility values, incorporating all the costs and bene ts (tangible and intangible) into one number for each possible outcome. If you do that, then the conclusion now results in a ipped decision to use your usual contractor (Contractor 2 in the decision tree below). Utility Values However, note that it is still a really close decision, as both contractors now have almost the same expected value! This closeness illustrates the power of probabilistic outcomes. Even though the new contractor is now associated with much higher potential “costs,” 50 percent of the time you’d still expect to pay them a much smaller amount. This lower cost drives the expected value down a lot because it happens so frequently. Just as in cost-bene t analysis and scoring pro-con lists, we recommend using utility values whenever possible because they paint a fuller picture of your underlying preferences, and therefore should result in more satisfactory decisions. In fact, more broadly, there is a philosophy called utilitarianism that expresses the view that the most ethical decision is the one that creates the most utility for all involved. Utilitarianism as a philosophy has various drawbacks, though. Primarily, decisions involving multiple people that increase overall utility can seem quite unfair when that utility is not equally distributed among the people involved (e.g., income inequality despite rising standards of living). Also, utility values can be hard to estimate. Nevertheless, utilitarianism is a useful philosophical model to be aware of, if only to consider what decision would increase overall utility the most. In any case, decision trees will help you start to make sense of what to do in situations with an array of diverse, probabilistic outcomes. Think about health insurance—should you go for a higher-deductible plan with lower payments or a lower-deductible plan with higher payments? It depends on your expected level of care, and whether you can a ord the lower-probability scenario where you will need to pay out a high deductible. (Note that the answer isn’t obvious, because with the lower- deductible plan you are making higher monthly premium payments. This increase in premiums could be viewed as paying out a portion of your deductible each month.) You can examine this scenario and others like it via a decision tree, accounting for your preferences along with the actual costs. Decision trees are especially useful to help you think about unlikely but severely impactful events. Consider more closely the scenario where you have a medical incident that requires you to pay out your full deductible. For some people, that amount of outlay could equate to bankruptcy, and so the true cost of this event occurring to them is much, much higher than the actual cost of the deductible. As a result, if you were in this situation, you would want to make the loss in utility value for this scenario extremely high to re ect your desire to avoid bankruptcy. Doing so would likely push you into a higher-premium plan with a lower deductible (that you can still a ord), and more assurance that you would avoid bankruptcy. In other words, if there is a chance of nancial ruin, you might want to avoid that plan even though on average it would lead to a better nancial outcome. One thing to watch out for in this type of analysis is the possibility of black swan events, which are extreme, consequential events (that end in things like nancial ruin), but which have signi cantly higher probabilities than you might initially expect. The name is derived from the false belief, held for many centuries in Europe and other places, that black swans did not exist, when in fact they were (and still are) common birds in Australia. As applied to decision tree analysis, a conservative approach would be to increase your probability estimates of low-probability but highly impactful scenarios like the bankruptcy one. This revision would account for the fact that the scenario might represent a black swan event, and that you might therefore be wrong about its probability. One reason that the probability of black swan events may be miscalculated relates to the normal distribution (see Chapter 5), which is the bell-curve-shaped probability distribution that explains the frequency of many natural phenomena (e.g., people’s heights). In a normal distribution, rare events occur on the tails of the distribution (e.g., really tall or short people), far from the middle of the bell curve. Black swan events, though, often come from fat-tailed distributions, which literally have fatter tails, meaning that events way out from the middle have a much higher probability when compared with a normal distribution. Fat-Tailed Distribution There are many naturally occurring fat-tailed distributions as well, and sometimes people just incorrectly assume they are dealing with a normal distribution when in fact they are dealing with a distribution with a fatter tail, and that means that events in the tail occur with higher probability. In practice, these are distributions where some of the biggest outliers happen more often than you would expect from a normal distribution, such as occurs with insurance payouts, or in the U.S. income distribution (see the histogram in Chapter 5). Another reason why you might miscalculate the probability of a black swan event is that you misunderstand the reasons for its occurrence. This can happen when you think a situation should come from one distribution, but multiple are really involved. For example, there are genetic reasons (e.g., dwar sm and Marfan syndrome) why there might be many more shorter or taller people than you would expect from just a regular normal distribution, which doesn’t account for these rarer genetic variations. A third reason is that you may underestimate the possibility and impact of cascading failures (see Chapter 4). As you recall, in a cascading-failure scenario, parts of the system are correlated: if one part falters, the next part falters, and so on. The 2007/2008 nancial crisis is an example, where the failure of mortgage-backed securities cascaded all the way to the banks and associated insurance companies. Our climate presents another example. The term one-hundred-year ood, denotes a ood that has a 1 percent chance of occurring in any given year. Unfortunately, climate change is raising the probability of the occurrence of what was once considered a one-hundred-year ood, and it no longer has a 1 percent chance in many areas. The dice are loaded. Houston, Texas, for example, has had three so-called ve-hundred-year oods in the last three years! The probabilities of these events clearly need to be adjusted as the cascading e ects of climate change continue to unfold. To better determine the outcome probabilities in highly complex systems like banking or climate, you may rst have to take a step back and try to make sense of the whole system before you can even try to create a decision tree or cost-bene t analysis for a particular subset or situation. Systems thinking describes this act, when you attempt to think about the entire system at once. By thinking about the overall system, you are more likely to understand and account for subtle interactions between components that could otherwise lead to unintended consequences from your decisions. For example, when thinking about making an investment, you might start to appreciate how seemingly unrelated parts of the economy might a ect its outcome. Some systems are fairly simple and you can picture the whole system in your head. Others are so complex that it is too challenging simultaneously to hold all the interlocking pieces in your head. One solution is literally to diagram the system visually. Drawing diagrams can help you get a better sense of complex systems and how the parts of the system interact with one another. Techniques for how to e ectively diagram complex systems are beyond the scope of this book, but know that there are many techniques that you can learn, including causal loop diagrams (which showcase feedback loops in a system) and stock and ow diagrams (which showcase how things accumulate and ow in a system). Gabriel’s master’s thesis involved diagraming the email spam system. The picture on the next page is one of his causal loop diagrams—you aren’t meant to understand this diagram; it’s just an example of what these things can end up looking like. Just know now that it was really helpful in gaining a much better understanding of this complex system. Email Spam Causal Loop Diagram As a further step, you can use software to imitate the system, called a simulation. In fact, software exists that allows you to compose a diagram of a system on your screen and then immediately turn it into a working simulation. (Two such programs that do this online are Insight Maker and True-World.) In the process, you can set initial conditions, and then see how the system unfolds over time. Simulations help you more deeply understand a complex system and lead to better predictions of black swans and other events. Simulations can also help you identify how a system will adjust when faced with changing conditions. Chatelier’s principle, named after French chemist Henri-Louis Le Chatelier, states that when any chemical system at equilibrium is subject to a change in conditions, such as a shift in temperature, volume, or pressure, it readjusts itself into a new equilibrium state and usually partially counteracts the change. For example, if someone hands you a box to carry, you don’t immediately topple over; you instead shift your weight distribution to account for the new weight. Or in economics, if a new tax is introduced, tax revenues from that tax end up being lower in the long run than one would expect under current conditions because people adjust their behavior to avoid the tax. If this sounds like a familiar concept, it’s because Chatelier’s principle is similar to the mental model homeostasis (see Chapter 4), which comes from biology: recall how your body automatically shivers and sweats in response to external conditions in order to regulate its internal temperature. Chatelier’s principle doesn’t necessarily mean the system will regulate around a predetermined value, but that it will react to externally imposed conditions, and usually in a way that partially counteracts the external stimulus. You can see the principle in action in real time with simulations because they allow you to calculate how your simulated system will adjust to various changes. A related mental model that also arises in dynamic systems and simulations is hysteresis, which describes how a system’s current state can be dependent on its history. Hysteresis is also a naturally occurring phenomenon, with examples across most scienti c disciplines. In physics, when you magnetize a material in one direction, such as by holding a magnet to another piece of metal, the metal does not fully demagnetize after you remove the magnet. In biology, the T cells that help power your immune system, once activated, thereafter require a lower threshold to reactivate. Hysteresis describes how both the metal and the T cells partially remember their states, such that what happened previously can impact what will happen next. Again, this may already seem like a familiar concept, because it is similar to the mental model of path dependence (see Chapter 2), which more generally describes how choices have consequences in terms of limiting what you can do in the future. Hysteresis is one type of path dependence, as applied to systems. In engineering systems, for example, it is useful to build some hysteresis into the system to avoid rapid changes. Modern thermostats do this by allowing for a range of temperatures around the set point: if you want to maintain 70 degrees Fahrenheit, a thermostat might be set to turn the heater on when the temperature drops to 68 degrees and back o when it hits 72 degrees. In this way, it isn’t kicking on and o constantly. Similarly, on websites, designers and developers often build in a lag for when you move your mouse o page elements like menus. They build their programs to remember that you were on the menu so that when you move o , it doesn’t abruptly go away, which can appear jarring to the eye. You can use all these mental models around visualizing complex systems and simulating them to help you better assess potential outcomes and their associated probabilities. Then you can feed these results into a more straightforward decision model like a decision tree or cost-bene t analysis. A particular type of simulation that can be especially useful in this way is a Monte Carlo simulation. Like critical mass (see Chapter 4), this is a model that emerged during the Manhattan Project in Los Alamos in the run- up to the discovery of the atomic bomb. Physicist Stanislaw Ulam was struggling with using traditional mathematics to determine how far neutrons would travel through various materials and came up with this new method after playing solitaire (yes, the card game). In his words, quoted in Los Alamos Science: The rst thoughts and attempts I made to practice [the Monte Carlo method] were suggested by a question which occurred to me in 1946 as I was convalescing from an illness and playing solitaires. The question was what are the chances that a Can eld solitaire laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than “abstract thinking” might not be to lay it out say one hundred times and simply observe and count the number of successful plays. A Monte Carlo simulation is actually many simulations run independently, with random initial conditions or other uses of random numbers within the simulation itself. By running a simulation of a system many times, you can begin to understand how probable di erent outcomes really are. Think of it as a dynamic sensitivity analysis. Monte Carlo simulations are used in nearly every branch of science. But they are useful outside science as well. For example, venture capitalists often use Monte Carlo simulations to determine how much capital to reserve for future nancings. When a venture fund invests in a company, that company, if successful, will probably raise more money in the future, and the fund will often want to participate in some of those future nancings to maintain its ownership percentage. How much money should it reserve for a company? Not all companies are successful, and di erent companies raise di erent amounts, so the answer is not straightforward at the time of the initial investment. Many funds use Monte Carlo simulations to understand how much they ought to reserve, given their current fund history and the estimates of company success and size of potential nancings. More generally, making the e ort to understand complex systems better through systems thinking—whether it be by using diagrams, running simulations, or employing other mental models—not only helps you get a broad picture of the system and its range of outcomes, but also can help you become aware of the best possible outcomes. Without such knowledge, you can get stuck chasing a local optimum solution, which is an admittedly good solution, but not the best one. If you can, you want to work toward that best solution, which would be the global optimum. Think of rolling hills: the top of a nice nearby hill would be a good success (local optimum), though in the distance there is a much bigger hill that would be a much better success (global optimum). You want to be on that bigger hill. But rst you have to have a full view of the system to know the bigger hill exists. Local vs. Global Optimum BEWARE OF UNKNOWN UNKNOWNS In 1955, psychologists Joseph Luft and Harrington Ingham originated the concept of unknown unknowns, which was made popular by former U.S. Secretary of Defense Donald Rumsfeld at a news brie ng on February 12, 2002, with this exchange: Jim Miklaszewski: In regard to Iraq, weapons of mass destruction, and terrorists, is there any evidence to indicate that Iraq has attempted to or is willing to supply terrorists with weapons of mass destruction? Because there are reports that there is no evidence of a direct link between Baghdad and some of these terrorist organizations. Rumsfeld: Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the di cult ones. The context and evasiveness of the exchange aside, the underlying model is useful in decision making. When faced with a decision, you can use a handy 2 × 2 matrix (see Chapter 4) as a starting point to envision these four categories of things you know and don’t know. Knowns & Unknowns Known Unknown Known What you know you know What you know you don’t know Knowns & Unknowns Unkown What you don’t know you know What you don’t know you don’t know This model is particularly e ective when thinking more systematically about risks, such as risks to a project’s success. Each category deserves its own attention and process: Known knowns: These might be risks to someone else, but not to you since you already know how to deal with them based on your previous experience. For example, a project might require a technological solution, but you already know what that solution is and how to implement it; you just need to execute that known plan. Known unknowns: These are also known risks to the project, but because of some uncertainty, it isn’t exactly clear how they will be resolved. An example is the risk of relying on a third party: until you engage with them directly, it is unknown how they will react. You can turn some of these into known knowns by doing de-risking exercises (see Chapter 1), getting rid of the uncertainty. Unknown knowns: These are the risks you’re not thinking about, but for which there exist clear mitigation plans. For example, your project might involve starting to do business in Europe over the summer, but you don’t yet know they don’t do much business in August. An adviser with more experience can help identify these risks from the start and turn these into known knowns. That way they will not take you by surprise later on and potentially throw o your project. Unknown unknowns: These are the least obvious risks, which require a concerted e ort to uncover. For example, maybe something elsewhere in the organization or in the industry could dramatically change this project (like budget cuts or an acquisition or new product announcement). Even if you identify an unknown unknown (turning it into a known unknown), you still remain unsure of its likelihood or consequences. You must then still do de-risking exercises to nally turn it into a known known. As you can see, you enumerate items in each of the four categories, and then work to make them all known knowns. This model is about giving yourself more complete knowledge of a situation. It’s similar to systems thinking, from the last section, in that you are attempting to get a full picture of the system so you can make better decisions. As a personal example, consider having a new baby. From reading all the books, you know the rst few weeks will be harrowing, you’ll want to take some time o work, you’ll need to buy a car seat, crib, diapers, etc.— these are the known knowns. You also know that how your baby might sleep and eat (or not) can be an issue, but until the baby is born, their proclivities remain uncertain—they are known unknowns. You might not yet know that swaddling a baby is a thing, but you’ll be shown how soon enough by a nurse or family member, turning this unknown known into a known known. And then there are things that no one knows yet or is even thinking about, such as whether your child could have a learning disability. A related model that can help you uncover unknown unknowns is scenario analysis (also known as scenario planning), which is a method for thinking about possible futures more deeply. It gets its name because it involves analyzing di erent scenarios that might unfold. That sounds simple enough, but it is deceptively complicated in practice. That’s because thinking up possible future scenarios is a really challenging exercise, and thinking through their likelihoods and consequences is even more so. Governments and large corporations have dedicated sta for scenario analysis. They are continually thinking up and writing reports about what the world could look like in the future and how their citizenry or shareholders might fare under those scenarios. Many academics, especially in political science, urban planning, economics, and related elds, similarly engage in prognosticating about the future. And of course, science ction is essentially an entire literary genre dedicated to scenario analysis. To do scenario analysis well, you must conjure plausible yet distinct futures, ultimately considering several possible scenarios. This process is di cult because you tend to latch onto your rst thoughts (see anchoring in Chapter 1), which usually depict a direct extrapolation of your current trajectory (the present), without challenging your own assumptions. One technique to ensure that you do challenge your assumptions is to list major events that could transpire (e.g., stock market crash, government regulation, major industry merger, etc.) and then trace their possible e ects back to your situation. Some may have little to no e ect, whereas others might form the basis for a scenario you should consider deeply. Another technique for thinking more broadly about possible future scenarios is the thought experiment, literally an experiment that occurs just in your thoughts, i.e., not in the physical world. The most famous thought experiment is probably “Schrödinger’s cat,” named after Austrian physicist Erwin Schrödinger, who thought it up in 1935 to explore the implications of di erent interpretations of the physics of quantum mechanics. From his 1935 paper “The Present Situation in Quantum Mechanics”: A cat is penned up in a steel chamber, along with the following device (which must be secured against direct interference by the cat): in a Geiger counter, there is a tiny bit of radioactive substance, so small, that perhaps in the course of the hour one of the atoms decays, but also, with equal probability, perhaps none; if it happens, the counter tube discharges and through a relay releases a hammer that shatters a small ask of hydrocyanic acid. If one has left this entire system to itself for an hour, one would say that the cat still lives if meanwhile no atom has decayed. The rst atomic decay would have poisoned it. So, you have a cat in a box, and if a radioactive atom decayed in the last hour, it would have killed the cat. This thought experiment poses some seemingly unanswerable questions: Until you observe the cat by opening the box, is it alive or dead, or in an in-between state, as certain interpretations of quantum mechanics would suggest? And what exactly happens when you open the box? Schrödinger’s Cat Thought Experiment Answers to this thought experiment are beyond the scope of this book and were argued over for decades after it was posed. Therein lies the power of the thought experiment. Thought experiments are particularly useful in scenario analysis. Posing questions that start with “What would happen if...” is a good practice in this way: What would happen if life expectancy jumped forty years? What would happen if a well-funded competitor copied our product? What would happen if I switched careers? These types of what-if questions can also be applied to the past, in what is called counterfactual thinking, which means thinking about the past by imagining that the past was di erent, counter to the facts of what actually occurred. You’ve probably seen this model in books and

Use Quizgecko on...
Browser
Browser