Chapter 6 Screening and Diagnostic Tests PDF

Document Details

Uploaded by Deleted User

Marcello Pagano

Tags

medical tests biostatistics diagnostic testing screening tests

Summary

This chapter from Pagano's "Principles of Biostatistics" discusses screening and diagnostic tests in a medical context. It covers fundamental concepts like sensitivity, specificity, and Bayes' theorem, highlighting the applications of these tests in healthcare.

Full Transcript

6 Screening and Diagnostic Tests CONTENTS 6.1 Sensitivity and Specificity....................................................... 136 6.2 Bayes’ Theorem................................................................. 137 6.3 Li...

6 Screening and Diagnostic Tests CONTENTS 6.1 Sensitivity and Specificity....................................................... 136 6.2 Bayes’ Theorem................................................................. 137 6.3 Likelihood Ratios............................................................... 142 6.4 ROC Curves.................................................................... 145 6.5 Calculation of Prevalence........................................................ 147 6.6 Varying Sensitivity.............................................................. 149 6.7 Further Applications............................................................ 151 6.8 Review Exercises................................................................ 155 A test is a tool used to determine the existence or nonexistence of some quality. In health, this framework is particularly useful when we are trying to establish whether an individual has or does not have a specified medical condition. In this case, we often use some sort of biological test. What all such tests have in common is that they are imperfect. As a consequence, if a test yields a positive reading indicating the existence of a condition or disease, or a negative reading indicating nonexistence of the condition, we do not know for certain that the test result is correct. Furthermore, we cannot identify the particular individuals for whom the result is wrong. What we can do is talk about the aggregate behavior of the test, and study the probabilities of both correct and incorrect outcomes. These probabilities can then be used to evaluate subsequent test results, and determine the appropriate action to be taken. The aggregate behavior of medical tests is what we study in this chapter. We concentrate on two classes of tests: screening tests and diagnostic tests. Statistically they are very similar, but they do differ in how they are utilized. Screening tests, as their name implies, are used to screen groups of individuals, who have not yet exhibited any clinical symptoms for the existence of a condition or disease, in order to classify them with respect to their probability of having that condition. For example, we could measure the spread of sars-cov-2 in a population by applying a screening test to a sample of individuals from that population. When using a screening test, we expect most of the results to be negative. Screening is most often employed by health care professionals in situations where the early detection of disease would contribute to a more favorable prognosis for the individual or for the population in general. Diagnostic tests, on the other hand, are used to confirm the diagnosis of a condition or disease. For example, we might test individuals who show symptoms of covid-19, the disease caused by sars-cov-2 – such as a high temperature and/or breathing difficulties – for infection by the virus. Sometimes diagnostic tests are employed subsequent to a positive screening test; for example, a biopsy is often performed following a mammogram that is positive for breast cancer. Usually the proportion of individuals testing positive with a screening test is smaller than the proportion testing positive with a diagnostic test, because those being subjected to the diagnostic test are more likely to have the disease to begin with. This is certainly true in the early stages of an epidemic. For both types of tests, however, those who test positive are considered to be more likely to have the disease than those who test negative. DOI: 10.1201/9780429340512-6 135 ISTUDY 136 Principles of Biostatistics 6.1 Sensitivity and Specificity Suppose we are interested in two mutually exclusive and exhaustive states of health. We could define D+ to be the event that an individual has a particular disease, and D− the complementary event that they do not have the disease. Let T + represent a positive result on a test for the presence of disease, and T − a negative result. We are usually most interested in knowing P(D+ | T + ), the probability that a person with a positive test actually has the disease. Before we calculate this probability, however, we first consider the detection properties of the test being performed. Cervical cancer is a disease for which the chance of containment is high given that it is discovered early. The Pap smear is a widely accepted screening test used to detect the abnormal growth of cells on the surface of the cervix in females who are as yet asymptomatic. It has been credited with being primarily responsible for the decreasing death rate due to cervical cancer. A large study conducted in Canada evaluated the performance of the Pap smear for detecting cervical intraepithelial neoplasia. The test was performed in groups of females with and without cervical cancer, as determined by colposcopy and biopsy. Overall, 55.4% of the tests performed on females known to have cervical cancer resulted in positive outcomes. A true positive occurs when the test of an individual who has cancer of the cervix correctly indicates that she does. Therefore, in this study, P(test positive | disease) = P(T + | D+ ) = 0.554. The probability of a positive test result given that the individual being tested does have the disease is called the sensitivity of the test. In this study, the sensitivity of the Pap smear was 0.554, or 55.4%. In the group of females who have cervical cancer, testing positive and testing negative are mutually exclusive and exhaustive events. Therefore, the other 100% − 55.4% = 44.6% of females who had cervical cancer tested negative, and P(test negative | disease) = P(T − | D+ ) = 1 − P(T + | D+ ) = 1 − 0.554 = 0.446 is the probability of a false negative result. In the group of females who do not have cervical cancer, 3.2% of the Pap smears resulted in false positive outcomes, where the test of an individual who does not have cancer incorrectly indicates that she does. This means that P(test positive | no disease) = P(T + | D− ) = 0.032. The specificity of a test is the probability that its result is negative given that the individual tested does not have the disease, or the probability of a true negative result. In this study, the specificity of the Pap smear was P(test negative | no disease) = P(T − | D− ) = 1 − P(T + | D− ) = 1 − 0.032 = 0.968, or 96.8%. The possible results of the screening test are summarized in the following table. ISTUDY Screening and Diagnostic Tests 137 Screening True Disease Status Test Result Disease No Disease Test Positive True positive False positive Test Negative False negative True negative While in most instances we desire tests that have both high sensitivity and high specificity, this cannot always be achieved. In some circumstances, one criterion is considered more important than the other. When we are trying to identify cases of disease, for instance, we want the probability of a false negative result to be low, and the sensitivity to be high. When trying to control a contagious disease, false negatives might further promote its spread if the individuals with these erroneous test results behave as if they were uninfected. Furthermore, when diagnosing a serious disease where early diagnosis increases the probability of successful treatment, a false negative delays the initiation of that treatment. Alternatively, when trying to rule out people who do not have a disease, we want the probability of a false positive result to be low, and the specificity to be high. False positive results can lead to unnecessary – even invasive or dangerous – treatment, and can negatively impact the mental health of those affected. Summary: Sensitivity and Specificity Term Notation Disease present D+ Disease absent D− Test positive T+ Test negative T− Probability of false negative P(T − | D + ) Sensitivity P(T + | D + ) Probability of false positive P(T + | D − ) Specificity P(T − | D − ) 6.2 Bayes’ Theorem As previously noted, sensitivity and specificity are properties of the test itself. They give us no information about an individual patient to whom the test is applied. But now that we have considered the accuracy of a test among individuals who have the disease of interest and those who do not, we are ready to investigate the question that is of primary concern to both an individual being tested and the health care professional involved in the screening: What is the probability that a person with a positive test result actually has the disease? Referring back to our previous example, what is the probability that a female with a Pap smear positive for cervical intraepithelial neoplasia does have such a cancer? Again, D + represents the event that an individual has cervical cancer, D − the event that she does not, and T + a positive Pap smear. We wish to compute P(D+ | T + ). ISTUDY 138 Principles of Biostatistics FIGURE 6.1 Venn diagram representing Bayes’ theorem To begin, we use the formula for a conditional probability to write P(D+ ∩ T + ) P(D+ | T + ) =. P(T + ) Applying the multiplicative rule of probability to the numerator of the right-hand side of the equation, we have P(D+ ) P(T + | D+ ) P(D+ | T + ) =. P(T + ) Furthermore, the total probability rule tells us that we can express the denominator as P(T + ) = P(D+ ∩ T + ) + P(D− ∩ T + ) = P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ). Putting this all together, we have P(D+ ∩ T + ) P(D+ | T + ) = P(T + ) P(D+ ) P(T + | D+ ) =. P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ) This rather daunting expression is known as Bayes’ theorem. The concept is illustrated in Figure 6.1. Looking at the individual components of the formula, we know that P(T + | D+ ) = 0.554 and P(T + | D− ) = 0.032; these are the sensitivity of the test and the probability of a false positive outcome, respectively. In order to apply the theorem, we still need to know P(D+ ) and P(D− ). P(D+ ) is the probability that a female being screened for cervical cancer really does have the disease. It can also be interpreted as the proportion of females with cervical intraepithelial neoplasia in the population being screened, or the prevalence of disease. Suppose that in the population of interest, the prevalence of cervical cancer is 8 per 100,000 population. In this case, P(D+ ) = 0.00008. P(D− ) is the probability that a female does not have cervical cancer. Since D− is the complement of D+ , P(D− ) = 1 − P(D+ ) = 1 − 0.00008 = 0.99992. ISTUDY Screening and Diagnostic Tests 139 Substituting these probabilities into Bayes’ theorem, P(D+ ) P(T + | D+ ) P(D+ | T + ) = P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ) 0.00008 × 0.554 = = 0.001383. (0.00008 × 0.554) + (0.99992 × 0.032) P(D+ | T + ), the probability of disease given a positive test result, is called the predictive value of a positive test, or the positive predictive value. Here, it tells us that for every 1,000,000 positive Pap smears, 1383 represent true cases of cervical cancer. Bayes’ theorem is not restricted to situations in which individuals fall into one of two distinct subgroups. If A1, A2,... , and An are n mutually exclusive and exhaustive events such that P( A1 ∪ A2 ∪ · · · ∪ An ) = P( A1 ) + P( A2 ) + · · · + P( An ) = 1, then Bayes’ theorem states that P( Ai ) P(B | Ai ) P( Ai | B) = P( A1 ) P(B | A1 ) + · · · + P( An ) P(B | An ) for each i, 1 ≤ i ≤ n. Bayes’ theorem is valuable because it allows us to recalculate a probability based on some new information. In the population being screened for cervical cancer, we know that the probability a randomly selected female has the disease is just the prevalence of disease, P(D+ ) = 0.00008. This is called the prior probability of disease. If we are then given an additional piece of information – the knowledge that a particular individual has tested positive on the Pap smear – our assessment of the probability changes. Using Bayes’ theorem, we found that P(D+ | T + ) = 0.001383. This conditional probability is called the posterior probability of disease. Although it seems low – for every 1,000,000 positive Pap smears, only 1,383 represent true positive results – we have still obtained useful information. Once we are told that an individual has a positive Pap smear, the probability that she has cervical cancer increases more than 17-fold – 0.001383/0.00008 = 17.3. We are in the realm of rare events; having cervical cancer. The positive test makes that event 17.3 times more likely. Just as we can calculate the predictive value of a positive test, Bayes’ theorem may also be used to calculate the predictive value of a negative test, or the negative predictive value. If T − represents the event of a negative test result, the negative predictive value – the probability of no disease given a negative test result – is equal to P(D− ) P(T − | D− ) P(D− | T − ) = P(D− ) P(T − | D− ) + P(D+ ) P(T − | D+ ) 0.99992 × 0.968 = = 0.999963. (0.99992 × 0.968) + (0.00008 × 0.446) Therefore, for every 1,000,000 females with negative Pap smears, 999,963 do not have cervical cancer. Figure 6.2 illustrates the results of the entire screening test process for cervical cancer. Note that all numbers have been rounded to the nearest integer. ISTUDY 140 Principles of Biostatistics FIGURE 6.2 Performance of the Pap smear as a screening test for cervical cancer As a second example of Bayes’ theorem, consider the use of the chest radiograph to screen for the presence of tuberculosis. Among the 1875 subjects in a study evaluating the performance of the chest radiograph as a screening test, 1525 were known to suffer from tuberculosis and 350 did not. Chest X-rays were administered to all individuals; 1363 had a positive X-ray showing significant evidence of disease, and the other 512 had a negative X-ray. The data for this study are presented in the table below. What is the probability that a randomly selected individual has tuberculosis given that their X-ray is positive? Similarly, what is the probability that they do not have tuberculosis given that their X-ray is negative? Tuberculosis X-ray Total No Yes Negative 178 334 512 Positive 172 1191 1363 Total 350 1525 1875 Let D+ represent the event that an individual has tuberculosis and D− the event that they do not. These two events are mutually exclusive and exhaustive. In addition, T + represents a positive chest radiograph. First we must use the data collected in the study to calculate the sensitivity and specificity of the chest X-ray. Note that 1191 P(T + | D+ ) = = 0.781, 1525 and 178 P(T − | D− ) = = 0.509. 350 ISTUDY Screening and Diagnostic Tests 141 We now wish to find P(D + | T + ), the probability that an individual who tests positive for tuberculosis actually has the disease. This is the positive predictive value of the X-ray. Bayes’ theorem tells us that P(D + ) P(T + | D + ) P(D + | T + ) =. P(D + ) P(T + | D + ) + P(D − ) P(T + | D − ) We know the sensitivity of the chest radiograph, and the probability of a false positive result can be calculated as 1 minus the test’s specificity. We still need to know P(D + ) and P(D − ) in order to apply the theorem. P(D + ) is the probability that an individual in the general population has tuberculosis, or the prevalence of disease. Since the 1875 individuals in the study described above were not chosen from the population at random, the prevalence of disease cannot be obtained from the information in the table. Note that it is extremely unlikely that the prevalence of tuberculosis in a population being tested would ever be as high as 1525/1875 = 0.813, or 81.3%. Suppose that the true prevalence of tuberculosis is 9.3 cases per 100,000 population, so that P(D+ ) = 0.000093. P(D − ) is the probability that an individual does not have tuberculosis. Since D − is the complement of D + , P(D − ) = 1 − P(D + ) = 1 − 0.000093 = 0.999907. Using all this information, we can now calculate the probability that an individual suffers from tuberculosis given that they have a positive chest X-ray; this positive predictive value is P(D + P(T + | D + ) P(D + | T + ) = P(D + ) P(T + | D + ) + P(D− ) P(T + | D − ) (0.000093)(0.781) = = 0.000148. (0.000093)(0.781) + (0.999907)(0.491) For every 1,000,000 positive X-rays, only 148 represent true cases of tuberculosis. The negative predictive value of the chest X-ray is P(D − ) P(T − | D− ) P(D − | T − ) = P(D − ) P(T − | D − ) + P(D+ ) P(T − | D + ) (0.999907)(0.509) = = 0.999960. (0.999907)(0.509) + (0.000093)(0.219) For every 1,000,000 negative X-rays, 999,960 individuals truly do not have the disease. How would the positive and negative predictive values of the chest radiograph change if the prevalence of tuberculosis in the population being screened was twice as high? In this case, P(D + ) = 0.000186 and P(D − ) = 1 − P(D + ) = 0.999814. Therefore, the positive predictive value of the chest X-ray increases, P(D + P(T + | D + ) P(D+ | T + ) = P(D + ) P(T + | D + ) + P(D − ) P(T + | D − ) (0.000186)(0.781) = = 0.000296, (0.000186)(0.781) + (0.999814)(0.491) ISTUDY 142 Principles of Biostatistics and the negative predictive value decreases, P(D− ) P(T − | D− ) P(D− | T − ) = P(D− ) P(T − | D− ) + P(D+ ) P(T − | D+ ) (0.999814)(0.509) = = 0.999920. (0.999814)(0.509) + (0.000186)(0.219) Summary: Bayes’ Theorem Term Notation P( Ai ) P(B | Ai ) Bayes’ theorem P( Ai | B) = P( A1 ) P(B | A1 ) + · · · + P( An ) P(B | An ) for each i, 1 ≤ i ≤ n if A1, A2,... , An are n mutually exclusive and exhaustive events such that P( A1 ∪ A2 ∪ · · · ∪ An ) = P( A1 ) + P( A2 ) + · · · + P( An ) = 1 Prevalence P(D + ) Positive predictive P(D + | T + ) value (PPV) Negative predictive P(D − | T − ) value (NPV) 6.3 Likelihood Ratios Returning to Bayes’ theorem, the formula for the posterior probability of disease given a positive test result is P(D+ ) P(T + | D+ ) P(D+ | T + ) =. P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ) We see that this posterior probability depends on the sensitivity and specificity of the test as well as the prior probability of disease, the prevalence P(D+ ). For a diagnostic test with sensitivity 0.9 and specificity 0.7, Figure 6.3 displays the positive predictive value as the prevalence ranges from just above 0 to nearly 1. Once again we see that the positive predictive value increases as the prevalence increases. We also note that when the prevalence of disease is low, the chance that a positive test result truly reflects the condition for which we are testing is correspondingly low. If the condition is severe, such as tuberculosis, then a positive test result warrants further testing regardless of the magnitude of this probability. However, if the positive test result might lead to more harm than good, then this low positive predictive value raises the question of whether the screening test is even worth performing. This case is raised in a Harvard Medical School publication where nine medical specialty organizations were each asked to list the most unnecessary tests and services in their fields. This list of “don’ts” included: ISTUDY Screening and Diagnostic Tests 143 FIGURE 6.3 Positive predictive value of a diagnostic test with sensitivity 0.9 and specificity 0.7 as a function of the prevalence of disease “Don’t perform stress cardiac imaging or advanced noninvasive imaging as part of routine follow-up in patients without symptoms of cardiovascular disease,” “Don’t image for suspected pulmonary embolism (pe) without moderate or high pre-test proba- bility,” and “For patients on dialysis who have limited life expectancies, don’t perform routine cancer screen- ing unless the patient has signs or symptoms of cancer.” Returning to Figure 6.3, this graph also highlights the worrisome property that the probability that a person with a positive reading truly has the disease for which they are being tested depends on the prevalence of that disease. In other words, a positive test result in two different locations with different prevalences of disease will result in two different interpretations. The probability of disease for a single individual with a positive test depends on who else is being tested. To overcome this predicament, we return to the concept of odds. The odds will allow us to make a relative statement about probabilities – not relative to other people being tested, but relative to the status of the individual themselves. Recall that the odds of an event are defined as p/(1 − p), the probability of the event divided by 1 minus the probability of the event. Bayes’ theorem allows us to calculate the probability of disease given a positive test result, and 1 minus this probability is P(D− ) P(T + | D− ) 1 − P(D+ | T + ) =. P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ) ISTUDY 144 Principles of Biostatistics Therefore, using notation similar to that for a conditional probability, the odds of having the disease given a positive test are P(D + | T + ) P(D + ) P(T + | D + ) O(D + | T + ) = =. 1 − P(D + | T + ) P(D − ) P(T + | D − ) Since P(D − ) = 1 − P(D + ), we can also write P(D + ) P(T + | D + ) O(D + | T + ) = 1 − P(D ) P(T + | D − ) + P(T + | D + ) = O(D + ) P(T + | D − ) sensitivity = O(D + ). 1 − specificity This formula relates the prior odds of disease, O(D + ), to the posterior odds of disease given a positive test result, O(D + | T + ). The term by which the prior odds of disease is multiplied, sensitivity/(1 − specificity), is called the positive likelihood ratio. The positive likelihood ratio of a test increases as its sensitivity or specificity or both increase. It quantifies the informative benefit of a positive test. A positive likelihood ratio less than 1 means that the odds of having a particular condition are lower after a positive test than they were before the test, which is the opposite of what we want to happen. If the positive likelihood ratio is equal to 1, the test neither increases nor decreases the odds of the condition, meaning that the test does not give us any information at all. A positive likelihood ratio greater than 1 means that the odds of the condition are higher after a positive test; the larger the positive likelihood ratio, the greater the impact of a positive result. As an example, a test with sensitivity 0.9 and specificity 0.7 will have a positive likelihood ratio of 0.9/(1 − 0.7) = 3. This indicates that if an individual tests positive, the odds of having the condition triple as a result of the test. When the Pap smear is used as a screening test for cervical cancer, the positive likelihood ratio is 0.554/(1 − 0.968) = 17.3. For the chest radiograph screening for tuberculosis, it is only 0.781/(1 − 0.509) = 1.53. This information provides a different perspective on the importance of these screening tests, independent of the prevalence of disease. As we have seen, Bayes’ theorem can also be used to calculate the probability of no disease given a negative test result, and from this we can find the odds of not having disease given a negative test: P(D − | T − ) O(D − | T − ) = 1 − P(D − | T − ) P(D − ) P(T − | D − ) = P(D + ) P(T − | D + ) P(D − ) P(T − | D − ) = 1 − P(D − ) P(T − | D + ) specificity = O(D − ) 1 − sensitivity The term specificity/(1 − sensitivity) is called the negative likelihood ratio, and quantifies the change in the odds of not having the disease after a negative test result. As with the positive likelihood ratio, the negative likelihood ratio of a test increases as its specificity or sensitivity or both increase. Returning to our examples, for a test with sensitivity 0.9 and specificity 0.7 the negative likelihood ratio is 0.7/(1 − 0.9) = 7. This indicates that if an individual tests negative, the odds of not having the condition increase by a factor of 7. For the Pap smear the negative likelihood ratio is 0.968/(1 − 0.554) = 2.17, and for the chest radiograph it is 0.509/(1 − 0.781) = 2.32. ISTUDY Screening and Diagnostic Tests 145 The value of the likelihood ratios is that neither the positive likelihood ratio nor the negative likelihood ratio depends on the prevalence of disease. They are properties of the test itself. Likelihood ratios provide the connection between the posterior odds of disease (or no disease), and the odds prior to testing. The most impactful tests are those with a high positive likelihood ratio and, at the same time, a high negative likelihood ratio. However, an increase in sensitivity often comes at the cost of a decrease in specificity, as we shall see in the next section. Usually we need an advance in technology – a new test – to increase both simultaneously. Summary: Odds and Likelihood Ratios Term Notation P(D + )P(T + | D+ ) Odds of disease given positive test O(D + | T + ) = P(D− )P(T + | D − ) P(D− )P(T − | D − ) Odds of no disease given negative test O(D − | T − ) = P(D+ )P(T − | D + ) sensitivity P(T + | D + ) Positive likelihood ratio = 1 − specificity P(T + | D − ) specificity P(T − | D − ) Negative likelihood ratio = 1 − sensitivity P(T − | D + ) 6.4 ROC Curves As we have seen, diagnosis is an imperfect process. A positive test result does not guarantee that the person being tested has disease. In theory, it is desirable to have a test that is both highly sensitive and highly specific. In reality, however, such a procedure is often not possible. Many tests are based on a continuous clinical measurement which can assume a range of values. A cutoff is set, and values on one side of the cutoff are called positive test results while those on the other side are called negative results. Shifting the cutoff so that either more or fewer results are labeled positive changes the sensitivity and specificity of the test. As we will see, however, there is a trade-off; as one increases, the other decreases. Consider Table 6.1, which displays data from a kidney transplant program in which renal allografts were performed. The level of serum creatinine – a chemical compound found in the blood and measured in milligrams percent – was used as a diagnostic tool for detecting potential transplant rejection. An increased creatinine level is often associated with subsequent organ failure. If we use a serum creatinine level greater than 2.9 mg % as an indicator of imminent rejection, the diagnostic test has sensitivity 0.303 and specificity 0.909. To increase the sensitivity, we could lower the arbitrary cutoff that distinguishes a positive test result from a negative one. If we use 1.2 mg %, for example, a much greater proportion of the results would be designated positive. In this case, we would rarely fail to identify a patient who will reject the organ. At the same time, we would increase the probability of a false positive result, thereby decreasing the specificity. By increasing the specificity we would hardly ever misclassify a person who is not going to reject the organ, and, in turn, would decrease the sensitivity. Likelihood ratios behave in a similar manner. Both measures change as the cutoff of serum creatinine is shifted; as the positive likelihood ratio increases, the negative likelihood ratio decreases. ISTUDY 146 Principles of Biostatistics TABLE 6.1 Sensitivity, specificity, and positive and negative likelihood ratios of serum creatinine level for predicting transplant rejection Serum Positive Negative Creatinine Sensitivity Specificity Likelihood Likelihood (mg %) Ratio Ratio 1.2 0.939 0.123 1.071 2.016 1.3 0.939 0.203 1.178 3.328 1.4 0.909 0.281 1.264 3.088 1.5 0.818 0.380 1.319 2.088 1.6 0.758 0.461 1.406 1.905 1.7 0.727 0.535 1.563 1.960 1.8 0.636 0.649 1.812 1.783 1.9 0.636 0.711 2.201 1.953 2.0 0.545 0.766 2.329 1.684 2.1 0.485 0.773 2.137 1.501 2.2 0.485 0.803 2.462 1.559 2.3 0.394 0.811 2.085 1.338 2.4 0.394 0.843 2.510 1.391 2.5 0.364 0.870 2.800 1.368 2.6 0.333 0.891 3.055 1.336 2.7 0.333 0.894 3.142 1.340 2.8 0.333 0.896 3.202 1.343 2.9 0.303 0.909 3.330 1.304 The relationship between sensitivity and specificity can be illustrated using a graph known as a receiver-operating characteristic (roc) curve. An roc curve is a line graph that plots the probability of a true positive result – the sensitivity of the diagnostic test – against the probability of a false positive result for a range of different cutoff points. These graphs were first used in the field of communications. As an example, Figure 6.4 displays an roc curve for the data contained in Table 6.1. When an existing test is being evaluated, this type of graph may be used to help assess the usefulness of the test and to determine the most appropriate cutoff point. The dashed line in Figure 6.4 corresponds to a test that gives positive and negative results by chance alone; for example, the test result is determined by flipping a coin. Such a test has no inherent value. The closer the line to the upper left-hand corner of the graph, the more accurate the test. Furthermore, the point which lies closest to this upper corner is usually chosen as the cutoff which maximizes both sensitivity and specificity simultaneously. Of course we all want a perfect test, but we must also be realistic. The judgment about which property to emphasize – sensitivity or specificity – is influenced by the reason for testing. When screening, we do not want to miss anyone who has the condition, and thus place more emphasis on sensitivity. If the individuals with positive results are then retested, we wish to retain a high sensitivity, but also desire a high specificity. This may involve using a more expensive test in the second round. ISTUDY Screening and Diagnostic Tests 147 FIGURE 6.4 roc curve for serum creatinine level as a test for transplant rejection 6.5 Calculation of Prevalence In addition to being used to determine positive and negative predictive values, diagnostic testing results can also be used to calculate the prevalence of disease in a population where the prevalence is not known. This information might then be used to help determine a treatment strategy. For example, to eradicate schistosomiasis – a disease caused by parasitic flatworms – the who recommends that first a community be classified as high-risk, medium-risk, or low-risk on the basis of prevalence of the disease; they then prescribe treatment according to the community’s status. As another example of the importance of estimating prevalence, the New York State Department of Health initiated a program to screen all infants born over a 28-month period for the human im- munodeficiency virus (hiv). Since maternal antibodies cross the placenta, the presence of antibodies in an infant signals infection in the mother. Because the tests were performed anonymously, however, no verification of the results was possible. The reported outcomes of the statewide screening are presented in Table 6.2. Let n+ represent the number of newborns who test positive and n the total number of infants screened. In each region of New York, hiv seroprevalence – or P(H), where H is the event that a mother is infected with the virus – is calculated as n+ /n. In Manhattan, for example, 50,364 infants were tested, and 799 of the results were positive. In this borough, therefore, n+ 799 = = 0.0159. n 50, 364 In the upstate urban region of New York State, n+ 119 = = 0.0014. n 88, 088 ISTUDY 148 Principles of Biostatistics TABLE 6.2 Percentage of hiv positive newborns by region for the state of New York, December 1987–March 1990 Number Total Percent Region Positive Tested Positive New York State exclusive of NYC 601 346,522 0.17 NYC suburban 329 120,422 0.27 Mid-Hudson Valley 71 29,450 0.24 Upstate urban 119 88,088 0.14 Upstate rural 82 108,562 0.08 New York City 3650 294,062 1.24 Manhattan 799 50,364 1.59 Bronx 998 58,003 1.72 Brooklyn 1352 104,613 1.29 Queens 424 67,474 0.63 Staten Island 77 13,608 0.57 There is a problem here, however. The quantity n+ /n actually represents P(T + ), the probability of a positive test result. If the screening test were perfect, then P(H) and P(T + ) would be identical. The test is not infallible, however; both false positive and false negative results are possible. In fact, applying the total probability and multiplicative rules, the true probability of a positive test is P(T + ) = P(T + ∩ H) + P(T + ∩ H C ) = P(T + | H) P(H) + P(T + | H C ) P(H C ) = P(T + | H) P(H) + [1 − P(T − | H C )] [1 − P(H)]. Note that a positive test result can occur in two different ways: either the mother is infected with hiv, or she is not. In addition to the prevalence of infection, this equation incorporates both the sensitivity and specificity of the screening test. If n+ /n is the probability of a positive test result, then how do we compute the prevalence of hiv? Using the expression for P(T + ), we are able to solve for the true quantity of interest. After some algebraic manipulation, we find that P(T + ) − P(T + | H C ) P(H) = P(T + | H) − P(T + | H C ) (n+ /n) − P(T + | H C ) =. P(T + | H) − P(T + | H C ) Since the prevalence of hiv infection is also a probability, its value must lie between 0 and 1. Considering the expression for P(H) above, for any screening test of value P(T + | H) > P(T + | H C ). In other words, the probability of a positive test result among individuals who are infected with hiv is higher than the probability among individuals who are not infected. As a result, the denominator ISTUDY Screening and Diagnostic Tests 149 of the ratio above is positive. For P(H) to be greater than 0, the numerator is required to be positive as well. Consequently, we must have n+ > P(T + | H C ) n = 1 − P(T − | H C ). The proportion of positive test results in the entire population must be greater than the proportion of positive results among those who are not infected with hiv. Note that the specificity of the screening test plays a critical role in the calculation of prevalence; if the prevalence is very low, the disease may not be detected by a test with inadequate specificity. Return to the data in Table 6.2. We do not know the sensitivity and specificity of the diagnostic procedure used, although we can be sure that the test was not perfect. Suppose, however, that the sensitivity of the screening test is 0.99 while its specificity is 0.998; these numbers represent the higher end of the range of possible values. Also, recall that the probability of a positive test result in Manhattan is 0.0159. As a result, the prevalence of hiv infection in this borough would be calculated as 0.0159 − (1 − 0.998) P(H) = = 0.0141, 0.99 − (1 − 0.998) which is lower than the probability of a positive test result. For the upstate urban region of New York, 0.0014 − (1 − 0.998) P(H) = = − 0.0006. 0.99 − (1 − 0.998) Even with a specificity as high as 0.998, the prevalence is calculated to be negative. Obviously, this is a nonsensical result, which most likely occurred because the testing procedure was not accurate enough to measure the very low prevalence of hiv in this region. 6.6 Varying Sensitivity We have been discussing sensitivity and specificity as if they are fixed properties of a screening test, but this is really an over-simplification. Some tests perform differently in different groups of people. For example, consider a pregnancy test that measures a woman’s level of hcg (human chorionic gonadotropin), a hormone found in urine. A normal pre-pregnancy level is less than 5 mIU/mL. Once a woman conceives, her level of hcg begins to rise. A level above 25 mIU/mL is considered a sign of pregnancy. Of course, when pregnancy becomes detectable varies across each pregnancy and each pregnancy kit used; the sensitivity of the test changes from day to day. Figure 6.5 depicts a summary of this phenomenon. More generally, consider a population that can be divided into two mutually exclusive groups; the event that a person is in the first group is represented by B1 , and the event that they are in the second group by B2. Assume that a screening test of interest has a different sensitivity in each of the two groups. We represent these sensitivities by S1 and S2 , respectively. If we define T + to be the event that a person tests positive, then S1 = P(T + | B1 ). Similarly, S2 = P(T + | B2 ). Returning to the pregnancy test example, suppose that B1 and B2 represent two distinct groups within the population of pregnant women. B1 is the event that a pregnant woman is using the test 4 or more days before the expected day of the missed period, and B2 the event that she is using the test 3 or fewer days before the expected missed period. B1 and B2 are complements. According to Figure 6.5, we expect that S1 is smaller than S2. Further suppose that in the total population of women using the pregnancy test, a proportion w0 comes from B1 , while 1 − w0 comes from B2. ISTUDY 150 Principles of Biostatistics FIGURE 6.5 Depiction of the varying sensitivities of a pregnancy test, depending on when it is used Then, for the population using the test, the sensitivity S can be calculated using the total probability rule: S = P(T + ∩ B1 ) + P(T + ∩ B2 ) = P(B1 )P(T + | B1 ) + P(B2 )P(T + | B2 ) = w0 S1 + (1 − w0 ) S2 If we are interested in accurately estimating the sensitivity for an intended market, then we must use a sample of women that reflects that market. Ideally, we want the w for the sample to be w0. If it is not, then one of two things can happen: 1. If w > w0 , the sample over-represents women in their very early pregnancies. This results in a lower sensitivity than for smaller w, and the product would not appear to perform as well as it should. 2. If w < w0 , the sample under-represents women in their early pregnancies, presumably those most interested in using the test. The sample would result in a higher sensitivity that makes the product look better than it should. While we have focused on sensitivity in this section, a parallel argument can be made for specificity. The phenomenon described is called spectrum bias. Spectrum bias occurs when the spectrum of users being tested does not reflect the spectrum of future or intended users. To quote the first sentence of an article by Ransohoff and Feinstein published in The New England Journal of Medicine , “Clinical investigations of the efficacy of diagnostic tests have often produced misleading results so that tests initially regarded as valuable were later rejected as worthless." ISTUDY Screening and Diagnostic Tests 151 6.7 Further Applications If A1 and A2 are mutually exclusive and exhaustive events such that P( A1 ∪ A2 ) = P( A1 ) + P( A2 ) = 1, then Bayes’ theorem states that P( A1 ) P(B | A1 ) P( A1 | B) =. P( A1 ) P(B | A1 ) + P( A2 ) P(B | A2 ) Bayes’ theorem is particularly important in diagnostic testing and screening. It relates the positive and negative predictive values of a test to its sensitivity and specificity, as well as the prevalence of disease in the population being tested. We previously examined the Pap smear as a screening test for cervical intraepithelial neoplasia. The same study which evaluated the performance of the Pap smear also assessed the properties of the hpv test – which screens for the DNA of oncogenic human papillomaviruses – to identify the presence of cervical cancer. Let D+ be the event that a female has cervical cancer, D− the event that she does not, and T + a positive hpv test. The study found that the sensitivity of the hpv test is 94.6%; therefore, P(T + | D+ ) = 0.946. This is much higher than the sensitivity for the Pap smear. Consequently, the probability of a false negative result is lower, P(T − | D+ ) = 1 − 0.946 = 0.054 rather than 0.446. The specificity of the hpv test is 94.1%, P(T − | D− ) = 0.941, which is comparable to that for the Pap smear. The probability of a false positive is P(T + | D− ) = 1 − 0.941 = 0.059. What is the probability that a female with a positive hpv test result has cervical cancer? Suppose this test is being used to screen the population previously described, with prevalence of cervical intraepithelial neoplasia P(D+ ) = 0.00008. Using Bayes’ theorem, the predictive value of a positive test is P(D+ ) P(T + | D+ ) P(D+ | T + ) = P(D+ ) P(T + | D+ ) + P(D− ) P(T + | D− ) (0.00008)(0.946) = = 0.001281. (0.00008)(0.946) + (0.99992)(0.059) Therefore, a positive hpv test increases the probability that a female in this population has cervical cancer from 0.00008 to 0.001281. What is the probability that a female does not have cervical cancer given that the hpv test is negative? Again applying Bayes’ theorem, the predictive value of a negative test is P(D− ) P(T − | D− ) P(D− | T − ) = P(D− ) P(T − | D− ) + P(D+ ) P(T − | D+ ) (0.99992)(0.941) = = 0.99999. (0.99992)(0.941) + (0.00008)(0.054) ISTUDY 152 Principles of Biostatistics FIGURE 6.6 Performance of the hpv test as a screening test for cervical cancer A negative test result increases the probability that a female does not have cervical cancer from 0.99992 to 0.99999. Figure 6.6 illustrates the results of the screening test process. The hpv test has a positive likelihood ratio of sensitivity 0.946 = = 16.0, 1 − specificity 1 − 0.941 and a negative likelihood ratio of specificity 0.941 = = 17.4. 1 − sensitivity 1 − 0.946 Recall that for the Pap smear, the positive and negative likelihood ratios were 17.3 and 2.17, respectively. Due to its higher sensitivity, a negative hpv test provides more information than a negative Pap smear. The information provided by positive test results is fairly similar, however. Many diagnostic and screening tests are based on continuous clinical measurements that can assume a range of values. Choice of a cutoff for distinguishing positive versus negative tests demon- strates a trade-off between sensitivity and specificity. For example, a study of 300 Iranian women used body mass index (bmi) as screening test for identifying breast cancer. Various cutoff values of bmi – where any value above that cutoff was taken to be a positive test result – and their corresponding sensitivities and specificities are given in Table 6.3. The roc curve is shown in Figure 6.7. Note that as the cutoff value for bmi goes up, its specificity increases but its sensitivity decreases. The goal of newborn screening (nbs) is the detection of infants who are pre-symptomatic, but who have an increased risk of congenital conditions for which early treatment could prevent intellectual and physical disability, even early death. The procedure includes testing a drop of blood – the heel prick sample – and a hearing screen. Although the program started by testing for a single condition, the metabolic disorder phenylketonuria (pku), the number of conditions and tests continues to grow. The disorders included in newborn screening vary across countries and locales, ISTUDY Screening and Diagnostic Tests 153 TABLE 6.3 Sensitivity and specificity of bmi for predicting breast cancer bmi Sensitivity Specificity (kg/m2 ) 18 1.000 0.000 20 1.000 0.010 22 0.990 0.115 24 0.950 0.415 26 0.850 0.600 28 0.660 0.735 30 0.470 0.865 32 0.340 0.915 34 0.210 0.930 36 0.170 0.970 38 0.070 0.980 40 0.010 0.995 FIGURE 6.7 roc curve for bmi as a test for breast cancer ISTUDY 154 Principles of Biostatistics TABLE 6.4 Statistics associated with newborn screening for five conditions Result PKU1 GALT2 BTD3 CH4 CAH5 Number of positive tests 9156 10,210 321 63,035 9410 Confirmed cases 289 54 19 1203 51 Assumed sensitivity (%) 100 100 100 100 100 Specificity (%) 99.8 99.7 99.98 98.5 99.2 Positive predictive value (%) 3.16 0.53 5.90 1.91 0.54 Prevalence: 1 in 13,050 62,800 67,000 3300 25,100 Positive likelihood ratio 499 333 5000 67 125 Theoretical Confirmation of Positive Results Prevalence 0.0316 0.0053 0.0592 0.0191 0.0054 Positive predictive value (%) 94.2 63.9 99.7 56.5 40.5 1 pku phenylketonuria, 2 galt galactosemia, 3 btd biotinidase deficiency, 4 ch congenital hypothyroidism, 5 cah congenital adrenal hyperplasia but most include pku, cystic fibrosis, sickle cell disease, critical congenital heart disease, and hearing loss. The majority of these conditions are very rare. As we have seen, unless a test has 100% specificity, a very low prevalence of disease will lead to a relatively large proportion of false positive results. To quantify the situation in nbs programs in the United States, a study reported on tests for three hereditary metabolic disorders and two congenital endocrinopathies. The results of this study are shown in Table 6.4. We added the row showing the positive likelihood ratio; this row shows the discovery value of such tests. For example, consider the pku screen. The prevalence of this condition in the general population of newborns is 1 in 13,050. Since this probability is so small, it is approximately equal to the odds of having pku. The positive likelihood ratio is 499, so the odds of having pku among those testing positive is 499(1/13,050), and the probability is approximately 1 in 27. The negative likelihood ratio is not included in the table because the sensitivity of the test is assumed to be 1, or 100%. Technically, this means the negative likelihood ratio is infinite; there are no false negatives. To ensure that this is a good approximation to reality, the cutoffs are set very high. Unfortunately this results in a large number of false positive results. After proper risk communication with the birth families, the false positives can presumably be corrected by further diagnostic testing. At the bottom of Table 6.4 we have added two rows with some additional information. Consider the question: What if we had another, independent test with the same sensitivity and specificity as the one used to create this table, but this time we applied the test not to the general population of infants born, but instead to those infants who had already tested positive the first time? For instance, with the pku column, we take the 9156 samples that tested positive – 289 of whom had pku – and subject them to a second test with sensitivity 100% and specificity 99.8%. The result is that the positive predictive value of the test would increase to 94.2%. ISTUDY Screening and Diagnostic Tests 155 6.8 Review Exercises 1. What is the value of Bayes’ theorem? How is it applied in diagnostic testing? 2. What would happen to the specificity of a screening or diagnostic test if you were to try to increase its sensitivity? 3. What is an advantage of reporting the positive likelihood ratio of a screening test rather than the positive predictive value? 4. One study has reported that the sensitivity of the mammogram as a screening test for detecting breast cancer is 0.869 while its specificity is 0.889. (a) What is the probability of a false negative test result? (b) What is the probability of a false positive test result? (c) In a population where the prevalence of breast cancer is 0.0025, what is the probability that a female has breast cancer given that her mammogram is positive? (d) Now suppose that the prevalence of breast cancer in the population being screened is 0.025. How does the probability that a female has cancer given that her mammogram is positive change? (e) In a population where the prevalence of breast cancer is 0.0025, what is the probability that a female does not have cancer given that her mammogram is negative? (f) In this study, what is the positive likelihood ratio of the mammogram? (g) What is the negative likelihood ratio of the mammogram? 5. The National Institute for Occupational Safety and Health has developed a case definition of carpal tunnel syndrome – an affliction of the wrist – that incorporates three criteria: symptoms of nerve involvement, a history of occupational risk factors, and the presence of physical exam findings. The sensitivity of this definition as a diagnostic test for carpal tunnel syndrome is 0.67, and its specificity is 0.58. (a) In a population where the prevalence of carpal tunnel syndrome is 15%, what is the predictive value of a positive test result? (b) How does this predictive value change if the prevalence is only 10%? If it is 5%? (c) Construct a diagram – like the one in Figure 6.3 – illustrating the results of the diagnostic testing process. Assume that you start with a population of 1,000,000 people, and that the prevalence of carpal tunnel syndrome is 15%. (d) What are the positive and negative likelihood ratios for the case definition? 6. The following data are taken from a study investigating the use of a technique called radionuclide ventriculography as a diagnostic test for detecting coronary artery dis- ease. Disease Test Total Present Absent Positive 302 80 382 Negative 179 372 551 Total 481 452 933 ISTUDY 156 Principles of Biostatistics (a) What is the sensitivity of radionuclide ventriculography in this study? What is its specificity? (b) For a population in which the prevalence of coronary artery disease is 0.10, calculate the probability that an individual has the disease given that they test positive using radionuclide ventriculography. (c) What is the predictive value of a negative test? 7. When screening for prostate cancer, many physicians use a prostate-specific antigen (psa) level ≥ 4.1 ng/ml as a positive test result. Using this psa cutoff, 82% of males under the age of 60 years who have prostate cancer will test negative, and 2% of those who do not have cancer will test positive. (a) What is the sensitivity of the psa test for detecting prostate cancer? What is its specificity? (b) In a population where the prevalence of prostate cancer is 1 in 10,000, what is the predictive value of a positive test? What is the predictive value of a negative test? Interpret these values. (c) Using likelihood ratios, evaluate which result provides more information about an individual patient, a positive psa test or a negative test. 8. The table below displays data taken from a study comparing self-reported smoking status with measured serum cotinine level. As part of the study, cotinine level was used as a diagnostic tool for predicting smoking status; the self-reported status was considered to be true. For a number of different cutoff points, the observed sensitivities and specificities are listed below. Cotinine Level (ng/ml) Sensitivity Specificity 5 0.971 0.898 7 0.964 0.931 9 0.960 0.946 11 0.954 0.951 13 0.950 0.954 14 0.949 0.956 15 0.945 0.960 17 0.939 0.963 19 0.932 0.965 (a) As the cutoff point is raised, how does the sensitivity of the test change? How does the specificity change? (b) As the cutoff point is raised, how does the probability of a false positive result change? How does the probability of a false negative result change? (c) Use these data to construct an roc curve. (d) Based on the graph, what value of serum cotinine level would you choose as an optimal cutoff point for predicting smoking status? Why? ISTUDY Screening and Diagnostic Tests 157 (e) If you want the probability of a false positive test result to be no higher than 4%, what is the sensitivity that could be achieved? 9. A study was conducted investigating the use of fasting capillary glycemia (FCG) – the level of glucose in the blood for individuals who have not eaten in a specified number of hours – as a screening test for diabetes. FCG cutoff points ranging from 3.9 to 8.9 mmol/liter were examined; the sensitivities and specificities of the test corresponding to these different levels are contained in the dataset diabetes. The levels of FCG are saved under the variable name fcg, the sensitivities under sensitivity, and the specificities under specificity. (a) How does the sensitivity of the screening test change as the cutoff point is raised from 3.9 to 8.9 mmol/l? How does the specificity change? (b) Use these data to construct an roc curve for FCG. (c) The investigators who conducted this study chose an FCG level of 5.6 mmol/liter as the optimal cutoff for predicting diabetes. Do you agree with this choice? Why or why not? ISTUDY

Use Quizgecko on...
Browser
Browser