Richard O. Duda, Peter E. Hart, David G. Stork - Pattern classification (2001, Wiley).pdf
Document Details
Uploaded by DauntlessAustin
Tags
Full Transcript
Contents 1 Introduction 3 1.1 Machine Perception............................ 3 1.2 An Example................................. 3 1.2.1 Related fields.....................
Contents 1 Introduction 3 1.1 Machine Perception............................ 3 1.2 An Example................................. 3 1.2.1 Related fields............................ 11 1.3 The Sub-problems of Pattern Classification............... 11 1.3.1 Feature Extraction......................... 11 1.3.2 Noise................................ 12 1.3.3 Overfitting............................. 12 1.3.4 Model Selection........................... 12 1.3.5 Prior Knowledge.......................... 12 1.3.6 Missing Features.......................... 13 1.3.7 Mereology.............................. 13 1.3.8 Segmentation............................ 13 1.3.9 Context............................... 14 1.3.10 Invariances............................. 14 1.3.11 Evidence Pooling.......................... 15 1.3.12 Costs and Risks.......................... 15 1.3.13 Computational Complexity.................... 16 1.4 Learning and Adaptation......................... 16 1.4.1 Supervised Learning........................ 16 1.4.2 Unsupervised Learning....................... 17 1.4.3 Reinforcement Learning...................... 17 1.5 Conclusion................................. 17 Summary by Chapters.............................. 17 Bibliographical and Historical Remarks.................... 19 Bibliography................................... 19 Index....................................... 22 1 2 CONTENTS Chapter 1 Introduction T hetenease with which we recognize a face, understand spoken words, read handwrit- characters, identify our car keys in our pocket by feel, and decide whether an apple is ripe by its smell belies the astoundingly complex processes that underlie these acts of pattern recognition. Pattern recognition — the act of taking in raw data and taking an action based on the “category” of the pattern — has been crucial for our survival, and over the past tens of millions of years we have evolved highly sophisticated neural and cognitive systems for such tasks. 1.1 Machine Perception It is natural that we should seek to design and build machines that can recognize patterns. From automated speech recognition, fingerprint identification, optical char- acter recognition, DNA sequence identification and much more, it is clear that reli- able, accurate pattern recognition by machine would be immensely useful. Moreover, in solving the myriad problems required to build such systems, we gain deeper un- derstanding and appreciation for pattern recognition systems in the natural world — most particularly in humans. For some applications, such as speech and visual recog- nition, our design efforts may in fact be influenced by knowledge of how these are solved in nature, both in the algorithms we employ and the design of special purpose hardware. 1.2 An Example To illustrate the complexity of some of the types of problems involved, let us consider the following imaginary and somewhat fanciful example. Suppose that a fish packing plant wants to automate the process of sorting incoming fish on a conveyor belt according to species. As a pilot project it is decided to try to separate sea bass from salmon using optical sensing. We set up a camera, take some sample images and begin to note some physical differences between the two types of fish — length, lightness, width, number and shape of fins, position of the mouth, and so on — and these suggest features to explore for use in our classifier. We also notice noise or variations in the 3 4 CHAPTER 1. INTRODUCTION images — variations in lighting, position of the fish on the conveyor, even “static” due to the electronics of the camera itself. Given that there truly are differences between the population of sea bass and that model of salmon, we view them as having different models — different descriptions, which are typically mathematical in form. The overarching goal and approach in pattern classification is to hypothesize the class of these models, process the sensed data to eliminate noise (not due to the models), and for any sensed pattern choose the model that corresponds best. Any techniques that further this aim should be in the conceptual toolbox of the designer of pattern recognition systems. Our prototype system to perform this very specific task might well have the form shown in Fig. 1.1. First the camera captures an image of the fish. Next, the camera’s pre- signals are preprocessed to simplify subsequent operations without loosing relevant processing information. In particular, we might use a segmentation operation in which the images of different fish are somehow isolated from one another and from the background. The segmentation information from a single fish is then sent to a feature extractor, whose purpose is to feature reduce the data by measuring certain “features” or “properties.” These features extraction (or, more precisely, the values of these features) are then passed to a classifier that evaluates the evidence presented and makes a final decision as to the species. The preprocessor might automatically adjust for average light level, or threshold the image to remove the background of the conveyor belt, and so forth. For the moment let us pass over how the images of the fish might be segmented and consider how the feature extractor and classifier might be designed. Suppose somebody at the fish plant tells us that a sea bass is generally longer than a salmon. These, then, give us our tentative models for the fish: sea bass have some typical length, and this is greater than that for salmon. Then length becomes an obvious feature, and we might attempt to classify the fish merely by seeing whether or not the length l of a fish exceeds some critical value l∗. To choose l∗ we could obtain some design or training training samples of the different types of fish, (somehow) make length measurements, samples and inspect the results. Suppose that we do this, and obtain the histograms shown in Fig. 1.2. These disappointing histograms bear out the statement that sea bass are somewhat longer than salmon, on average, but it is clear that this single criterion is quite poor; no matter how we choose l∗ , we cannot reliably separate sea bass from salmon by length alone. Discouraged, but undeterred by these unpromising results, we try another feature — the average lightness of the fish scales. Now we are very careful to eliminate variations in illumination, since they can only obscure the models and corrupt our new classifier. The resulting histograms, shown in Fig. 1.3, are much more satisfactory — the classes are much better separated. So far we have tacitly assumed that the consequences of our actions are equally costly: deciding the fish was a sea bass when in fact it was a salmon was just as cost undesirable as the converse. Such a symmetry in the cost is often, but not invariably the case. For instance, as a fish packing company we may know that our customers easily accept occasional pieces of tasty salmon in their cans labeled “sea bass,” but they object vigorously if a piece of sea bass appears in their cans labeled “salmon.” If we want to stay in business, we should adjust our decision boundary to avoid antagonizing our customers, even if it means that more salmon makes its way into the cans of sea bass. In this case, then, we should move our decision boundary x∗ to smaller values of lightness, thereby reducing the number of sea bass that are classified as salmon (Fig. 1.3). The more our customers object to getting sea bass with their 1.2. AN EXAMPLE 5 Figure 1.1: The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed, then the features extracted and finally the classifi- cation emitted (here either “salmon” or “sea bass”). Although the information flow is often chosen to be from the source to the classifier (“bottom-up”), some systems employ “top-down” flow as well, in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows). Yet others combine two or more stages into a unified step, such as simultaneous segmentation and feature extraction. salmon — i.e., the more costly this type of error — the lower we should set the decision threshold x∗ in Fig. 1.3. Such considerations suggest that there is an overall single cost associated with our decision, and our true task is to make a decision rule (i.e., set a decision boundary) so as to minimize such a cost. This is the central task of decision theory of which decision pattern classification is perhaps the most important subfield. theory Even if we know the costs associated with our decisions and choose the optimal decision boundary x∗ , we may be dissatisfied with the resulting performance. Our first impulse might be to seek yet a different feature on which to separate the fish. Let us assume, though, that no other single visual feature yields better performance than that based on lightness. To improve recognition, then, we must resort to the use 6 CHAPTER 1. INTRODUCTION salmon sea bass Count 22 20 18 16 12 10 8 6 4 2 0 Length 5 10 15 20 25 l* Figure 1.2: Histograms for the length feature for the two categories. No single thresh- old value l∗ (decision boundary) will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value l∗ marked will lead to the smallest number of errors, on average. Count 14 salmon sea bass 12 10 8 6 4 2 0 Lightness 2 4 x* 6 8 10 Figure 1.3: Histograms for the lightness feature for the two categories. No single threshold value x∗ (decision boundary) will serve to unambiguously discriminate be- tween the two categories; using lightness alone, we will have some errors. The value x∗ marked will lead to the smallest number of errors, on average. 1.2. AN EXAMPLE 7 Width 22 salmon sea bass 21 20 19 18 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.4: The two features of lightness and width for sea bass and salmon. The dark line might serve as a decision boundary of our classifier. Overall classification error on the data shown is lower than if we use only one feature as in Fig. 1.3, but there will still be some errors. of more than one feature at a time. In our search for other features, we might try to capitalize on the observation that sea bass are typically wider than salmon. Now we have two features for classifying fish — the lightness x1 and the width x2. If we ignore how these features might be measured in practice, we realize that the feature extractor has thus reduced the image of each fish to a point or feature vector x in a two-dimensional feature space, where x1 x=. x2 Our problem now is to partition the feature space into two regions, where for all patterns in one region we will call the fish a sea bass, and all points in the other we call it a salmon. Suppose that we measure the feature vectors for our samples and obtain the scattering of points shown in Fig. 1.4. This plot suggests the following rule for separating the fish: Classify the fish as sea bass if its feature vector falls above the decision boundary shown, and as salmon otherwise. decision This rule appears to do a good job of separating our samples and suggests that boundary perhaps incorporating yet more features would be desirable. Besides the lightness and width of the fish, we might include some shape parameter, such as the vertex angle of the dorsal fin, or the placement of the eyes (as expressed as a proportion of the mouth-to-tail distance), and so on. How do we know beforehand which of these features will work best? Some features might be redundant: for instance if the eye color of all fish correlated perfectly with width, then classification performance need not be improved if we also include eye color as a feature. Even if the difficulty or computational cost in attaining more features is of no concern, might we ever have too many features? Suppose that other features are too expensive or expensive to measure, or provide little improvement (or possibly even degrade the performance) in the approach de- scribed above, and that we are forced to make our decision based on the two features in Fig. 1.4. If our models were extremely complicated, our classifier would have a decision boundary more complex than the simple straight line. In that case all the 8 CHAPTER 1. INTRODUCTION Width 22 salmon sea bass 21 20 19 18 ? 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.5: Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be misclassified as a sea bass. training patterns would be separated perfectly, as shown in Fig. 1.5. With such a “solution,” though, our satisfaction would be premature because the central aim of designing a classifier is to suggest actions when presented with novel patterns, i.e., general- fish not yet seen. This is the issue of generalization. It is unlikely that the complex ization decision boundary in Fig. 1.5 would provide good generalization, since it seems to be “tuned” to the particular training samples, rather than some underlying characteris- tics or true model of all the sea bass and salmon that will have to be separated. Naturally, one approach would be to get more training samples for obtaining a better estimate of the true underlying characteristics, for instance the probability distributions of the categories. In most pattern recognition problems, however, the amount of such data we can obtain easily is often quite limited. Even with a vast amount of training data in a continuous feature space though, if we followed the approach in Fig. 1.5 our classifier would give a horrendously complicated decision boundary — one that would be unlikely to do well on novel patterns. Rather, then, we might seek to “simplify” the recognizer, motivated by a belief that the underlying models will not require a decision boundary that is as complex as that in Fig. 1.5. Indeed, we might be satisfied with the slightly poorer performance on the training samples if it means that our classifier will have better performance on novel patterns.∗ But if designing a very complex recognizer is unlikely to give good generalization, precisely how should we quantify and favor simpler classifiers? How would our system automatically determine that the simple curve in Fig. 1.6 is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicated boundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoff, can we then predict how well our system will generalize to new patterns? These are some of the central problems in statistical pattern recognition. For the same incoming patterns, we might need to use a drastically different cost ∗ The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), who advocated favoring simpler explanations over those that are needlessly complicated — Entia non sunt multiplicanda praeter necessitatem (“Entities are not to be multiplied without necessity”). Decisions based on overly complex models often lead to lower accuracy of the classifier. 1.2. AN EXAMPLE 9 Width 22 salmon sea bass 21 20 19 18 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.6: The decision boundary shown might represent the optimal tradeoff be- tween performance on the training set and simplicity of classifier. function, and this will lead to different actions altogether. We might, for instance, wish instead to separate the fish based on their sex — all females (of either species) from all males if we wish to sell roe. Alternatively, we might wish to cull the damaged fish (to prepare separately for cat food), and so on. Different decision tasks may require features and yield boundaries quite different from those useful for our original categorization problem. This makes it quite clear that our decisions are fundamentally task or cost specific, and that creating a single general purpose artificial pattern recognition device — i.e., one capable of acting accurately based on a wide variety of tasks — is a profoundly difficult challenge. This, too, should give us added appreciation of the ability of humans to switch rapidly and fluidly between pattern recognition tasks. Since classification is, at base, the task of recovering the model that generated the patterns, different classification techniques are useful depending on the type of candi- date models themselves. In statistical pattern recognition we focus on the statistical properties of the patterns (generally expressed in probability densities), and this will command most of our attention in this book. Here the model for a pattern may be a single specific set of features, though the actual pattern sensed has been corrupted by some form of random noise. Occasionally it is claimed that neural pattern recognition (or neural network pattern classification) should be considered its own discipline, but despite its somewhat different intellectual pedigree, we will consider it a close descen- dant of statistical pattern recognition, for reasons that will become clear. If instead the model consists of some set of crisp logical rules, then we employ the methods of syntactic pattern recognition, where rules or grammars describe our decision. For ex- ample we might wish to classify an English sentence as grammatical or not, and here statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate. It was necessary in our fish example to choose our features carefully, and hence achieve a representation (as in Fig. 1.6) that enabled reasonably successful pattern classification. A central aspect in virtually every pattern recognition problem is that of achieving such a “good” representation, one in which the structural relationships among the components is simply and naturally revealed, and one in which the true (unknown) model of the patterns can be expressed. In some cases patterns should be represented as vectors of real-valued numbers, in others ordered lists of attributes, in yet others descriptions of parts and their relations, and so forth. We seek a represen- 10 CHAPTER 1. INTRODUCTION tation in which the patterns that lead to the same action are somehow “close” to one another, yet “far” from those that demand a different action. The extent to which we create or learn a proper representation and how we quantify near and far apart will determine the success of our pattern classifier. A number of additional characteris- tics are desirable for the representation. We might wish to favor a small number of features, which might lead to simpler decision regions, and a classifier easier to train. We might also wish to have features that are robust, i.e., relatively insensitive to noise or other errors. In practical applications we may need the classifier to act quickly, or use few electronic components, memory or processing steps. A central technique, when we have insufficient training data, is to incorporate knowledge of the problem domain. Indeed the less the training data the more impor- tant is such knowledge, for instance how the patterns themselves were produced. One analysis method that takes this notion to its logical extreme is that of analysis by synthesis, by where in the ideal case one has a model of how each pattern is generated. Con- synthesis sider speech recognition. Amidst the manifest acoustic variability among the possible “dee”s that might be uttered by different people, one thing they have in common is that they were all produced by lowering the jaw slightly, opening the mouth, placing the tongue tip against the roof of the mouth after a certain delay, and so on. We might assume that “all” the acoustic variation is due to the happenstance of whether the talker is male or female, old or young, with different overall pitches, and so forth. At some deep level, such a “physiological” model (or so-called “motor” model) for production of the utterances is appropriate, and different (say) from that for “doo” and indeed all other utterances. If this underlying model of production can be deter- mined from the sound (and that is a very big if ), then we can classify the utterance by how it was produced. That is to say, the production representation may be the “best” representation for classification. Our pattern recognition systems should then analyze (and hence classify) the input pattern based on how one would have to synthesize that pattern. The trick is, of course, to recover the generating parameters from the sensed pattern. Consider the difficulty in making a recognizer of all types of chairs — standard office chair, contemporary living room chair, beanbag chair, and so forth — based on an image. Given the astounding variety in the number of legs, material, shape, and so on, we might despair of ever finding a representation that reveals the unity within the class of chair. Perhaps the only such unifying aspect of chairs is functional: a chair is a stable artifact that supports a human sitter, including back support. Thus we might try to deduce such functional properties from the image, and the property “can support a human sitter” is very indirectly related to the orientation of the larger surfaces, and would need to be answered in the affirmative even for a beanbag chair. Of course, this requires some reasoning about the properties and naturally touches upon computer vision rather than pattern recognition proper. Without going to such extremes, many real world pattern recognition systems seek to incorporate at least some knowledge about the method of production of the pat- terns or their functional use in order to insure a good representation, though of course the goal of the representation is classification, not reproduction. For instance, in op- tical character recognition (OCR) one might confidently assume that handwritten characters are written as a sequence of strokes, and first try to recover a stroke rep- resentation from the sensed image, and then deduce the character from the identified strokes. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 11 1.2.1 Related fields Pattern classification differs from classical statistical hypothesis testing, wherein the sensed data are used to decide whether or not to reject a null hypothesis in favor of some alternative hypothesis. Roughly speaking, if the probability of obtaining the data given some null hypothesis falls below a “significance” threshold, we reject the null hypothesis in favor of the alternative. For typical values of this criterion, there is a strong bias or predilection in favor of the null hypothesis; even though the alternate hypothesis may be more probable, we might not be able to reject the null hypothesis. Hypothesis testing is often used to determine whether a drug is effective, where the null hypothesis is that it has no effect. Hypothesis testing might be used to determine whether the fish on the conveyor belt belong to a single class (the null hypothesis) or from two classes (the alternative). In contrast, given some data, pattern classification seeks to find the most probable hypothesis from a set of hypotheses — “this fish is probably a salmon.” Pattern classification differs, too, from image processing. In image processing, the image input is an image and the output is an image. Image processing steps often include processing rotation, contrast enhancement, and other transformations which preserve all the original information. Feature extraction, such as finding the peaks and valleys of the intensity, lose information (but hopefully preserve everything relevant to the task at hand.) As just described, feature extraction takes in a pattern and produces feature values. The number of features is virtually always chosen to be fewer than the total necessary to describe the complete target of interest, and this leads to a loss in information. In acts of associative memory, the system takes in a pattern and emits another pattern associative which is representative of a general group of patterns. It thus reduces the information memory somewhat, but rarely to the extent that pattern classification does. In short, because of the crucial role of a decision in pattern recognition information, it is fundamentally an information reduction process. The classification step represents an even more radical loss of information, reducing the original several thousand bits representing all the color of each of several thousand pixels down to just a few bits representing the chosen category (a single bit in our fish example.) 1.3 The Sub-problems of Pattern Classification We have alluded to some of the issues in pattern classification and we now turn to a more explicit list of them. In practice, these typically require the bulk of the research and development effort. Many are domain or problem specific, and their solution will depend upon the knowledge and insights of the designer. Nevertheless, a few are of sufficient generality, difficulty, and interest that they warrant explicit consideration. 1.3.1 Feature Extraction The conceptual boundary between feature extraction and classification proper is some- what arbitrary: an ideal feature extractor would yield a representation that makes the job of the classifier trivial; conversely, an omnipotent classifier would not need the help of a sophisticated feature extractor. The distinction is forced upon us for practi- cal, rather than theoretical reasons. Generally speaking, the task of feature extraction is much more problem and domain dependent than is classification proper, and thus requires knowledge of the domain. A good feature extractor for sorting fish would 12 CHAPTER 1. INTRODUCTION surely be of little use for identifying fingerprints, or classifying photomicrographs of blood cells. How do we know which features are most promising? Are there ways to automatically learn which features are best for the classifier? How many shall we use? 1.3.2 Noise The lighting of the fish may vary, there could be shadows cast by neighboring equip- ment, the conveyor belt might shake — all reducing the reliability of the feature values actually measured. We define noise very general terms: any property of the sensed pattern due not to the true underlying model but instead to randomness in the world or the sensors. All non-trivial decision and pattern recognition problems involve noise in some form. In some cases it is due to the transduction in the signal and we may consign to our preprocessor the role of cleaning up the signal, as for instance visual noise in our video camera viewing the fish. An important problem is knowing some- how whether the variation in some signal is noise or instead to complex underlying models of the fish. How then can we use this information to improve our classifier? 1.3.3 Overfitting In going from Fig 1.4 to Fig. 1.5 in our fish classification problem, we were, implicitly, using a more complex model of sea bass and of salmon. That is, we were adjusting the complexity of our classifier. While an overly complex model may allow perfect classification of the training samples, it is unlikely to give good classification of novel patterns — a situation known as overfitting. One of the most important areas of re- search in statistical pattern classification is determining how to adjust the complexity of the model — not so simple that it cannot explain the differences between the cat- egories, yet not so complex as to give poor classification on novel patterns. Are there principled methods for finding the best (intermediate) complexity for a classifier? 1.3.4 Model Selection We might have been unsatisfied with the performance of our fish classifier in Figs. 1.4 & 1.5, and thus jumped to an entirely different class of model, for instance one based on some function of the number and position of the fins, the color of the eyes, the weight, shape of the mouth, and so on. How do we know when a hypothesized model differs significantly from the true model underlying our patterns, and thus a new model is needed? In short, how are we to know to reject a class of models and try another one? Are we as designers reduced to random and tedious trial and error in model selection, never really knowing whether we can expect improved performance? Or might there be principled methods for knowing when to jettison one class of models and invoke another? Can we automate the process? 1.3.5 Prior Knowledge In one limited sense, we have already seen how prior knowledge — about the lightness of the different fish categories helped in the design of a classifier by suggesting a promising feature. Incorporating prior knowledge can be far more subtle and difficult. In some applications the knowledge ultimately derives from information about the production of the patterns, as we saw in analysis-by-synthesis. In others the knowledge may be about the form of the underlying categories, or specific attributes of the patterns, such as the fact that a face has two eyes, one nose, and so on. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 13 1.3.6 Missing Features Suppose that during classification, the value of one of the features cannot be deter- mined, for example the width of the fish because of occlusion by another fish (i.e., occlusion the other fish is in the way). How should the categorizer compensate? Since our two-feature recognizer never had a single-variable threshold value x∗ determined in anticipation of the possible absence of a feature (cf., Fig. 1.3), how shall it make the best decision using only the feature present? The naive method, of merely assuming that the value of the missing feature is zero or the average of the values for the train- ing patterns, is provably non-optimal. Likewise we occasionally have missing features during the creation or learning in our recognizer. How should we train a classifier or use one when some features are missing? 1.3.7 Mereology We effortlessly read a simple word such as BEATS. But consider this: Why didn’t we read instead other words that are perfectly good subsets of the full pattern, such as BE, BEAT, EAT, AT, and EATS? Why don’t they enter our minds, unless explicitly brought to our attention? Or when we saw the B why didn’t we read a P or an I, which are “there” within the B? Conversely, how is it that we can read the two unsegmented words in POLOPONY — without placing the entire input into a single word category? This is the problem of subsets and supersets — formally part of mereology, the study of part/whole relationships. It is closely related to that of prior knowledge and segmentation. In short, how do we recognize or group together the “proper” number of elements — neither too few nor too many? It appears as though the best classifiers try to incorporate as much of the input into the categorization as “makes sense,” but not too much. How can this be done? 1.3.8 Segmentation In our fish example, we have tacitly assumed that the fish were isolated, separate on the conveyor belt. In practice, they would often be abutting or overlapping, and our system would have to determine where one fish ends and the next begins — the individual patterns have to be segmented. If we have already recognized the fish then it would be easier to segment them. But how can we segment the images before they have been categorized or categorize them before they have been segmented? It seems we need a way to know when we have switched from one model to another, or to know when we just have background or “no category.” How can this be done? Segmentation is one of the deepest problems in automated speech recognition. We might seek to recognize the individual sounds (e.g., phonemes, such as “ss,” “k,”...), and then put them together to determine the word. But consider two nonsense words, “sklee” and “skloo.” Speak them aloud and notice that for “skloo” you push your lips forward (so-called “rounding” in anticipation of the upcoming “oo”) before you utter the “ss.” Such rounding influences the sound of the “ss,” lowering the frequency spectrum compared to the “ss” sound in “sklee” — a phenomenon known as anticipatory coarticulation. Thus, the “oo” phoneme reveals its presence in the “ss” earlier than the “k” and “l” which nominally occur before the “oo” itself! How do we segment the “oo” phoneme from the others when they are so manifestly intermingled? Or should we even try? Perhaps we are focusing on groupings of the wrong size, and that the most useful unit for recognition is somewhat larger, as we saw in subsets and 14 CHAPTER 1. INTRODUCTION supersets (Sect. 1.3.7). A related problem occurs in connected cursive handwritten character recognition: How do we know where one character “ends” and the next one “begins”? 1.3.9 Context We might be able to use context — input-dependent information other than from the target pattern itself — to improve our recognizer. For instance, it might be known for our fish packing plant that if we are getting a sequence of salmon, that it is highly likely that the next fish will be a salmon (since it probably comes from a boat that just returned from a fishing area rich in salmon). Thus, if after a long series of salmon our recognizer detects an ambiguous pattern (i.e., one very close to the nominal decision boundary), it may nevertheless be best to categorize it too as a salmon. We shall see how such a simple correlation among patterns — the most elementary form of context — might be used to improve recognition. But how, precisely, should we incorporate such information? Context can be highly complex and abstract. The utterance “jeetyet?” may seem nonsensical, unless you hear it spoken by a friend in the context of the cafeteria at lunchtime — “did you eat yet?” How can such a visual and temporal context influence your speech recognition? 1.3.10 Invariances In seeking to achieve an optimal representation for a particular pattern classification task, we confront the problem of invariances. In our fish example, the absolute position on the conveyor belt is irrelevant to the category and thus our representation should also be insensitive to absolute position of the fish. Here we seek a representation that is invariant to the transformation of translation (in either horizontal or vertical directions). Likewise, in a speech recognition problem, it might be required only that we be able to distinguish between utterances regardless of the particular moment they were uttered; here the “translation” invariance we must ensure is in time. The “model parameters” describing the orientation of our fish on the conveyor belt are horrendously complicated — due as they are to the sloshing of water, the bumping of neighboring fish, the shape of the fish net, etc. — and thus we give up hope of ever trying to use them. These parameters are irrelevant to the model parameters that interest us anyway, i.e., the ones associated with the differences between the fish categories. Thus here we try to build a classifier that is invariant to transformations such as rotation. orientation The orientation of the fish on the conveyor belt is irrelevant to its category. Here the transformation of concern is a two-dimensional rotation about the camera’s line of sight. A more general invariance would be for rotations about an arbitrary line in three dimensions. The image of even such a “simple” object as a coffee cup undergoes radical variation as the cup is rotated to an arbitrary angle — the handle may become hidden, the bottom of the inside volume come into view, the circular lip appear oval or a straight line or even obscured, and so forth. How might we insure that our pattern recognizer is invariant to such complex changes? size The overall size of an image may be irrelevant for categorization. Such differences might be due to variation in the range to the object; alternatively we may be genuinely unconcerned with differences between sizes — a young, small salmon is still a salmon. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 15 For patterns that have inherent temporal variation, we may want our recognizer to be insensitive to the rate at which the pattern evolves. Thus a slow hand wave and rate a fast hand wave may be considered as equivalent. Rate variation is a deep problem in speech recognition, of course; not only do different individuals talk at different rates, but even a single talker may vary in rate, causing the speech signal to change in complex ways. Likewise, cursive handwriting varies in complex ways as the writer speeds up — the placement of dots on the i’s, and cross bars on the t’s and f’s, are the first casualties of rate increase, while the appearance of l’s and e’s are relatively inviolate. How can we make a recognizer that changes its representations for some categories differently from that for others under such rate variation? A large number of highly complex transformations arise in pattern recognition, and many are domain specific. We might wish to make our handwritten optical character recognizer insensitive to the overall thickness of the pen line, for instance. Far more severe are transformations such as non-rigid deformations that arise in three- deformation dimensional object recognition, such as the radical variation in the image of your hand as you grasp an object or snap your fingers. Similarly, variations in illumination or the complex effects of cast shadows may need to be taken into account. The symmetries just described are continuous — the pattern can be translated, rotated, sped up, or deformed by an arbitrary amount. In some pattern recognition applications other — discrete — symmetries are relevant, such as flips left-to-right, discrete or top-to-bottom. symmetry In all of these invariances the problem arises: How do we determine whether an invariance is present? How do we efficiently incorporate such knowledge into our recognizer? 1.3.11 Evidence Pooling In our fish example we saw how using multiple features could lead to improved recog- nition. We might imagine that we could do better if we had several component classifiers. If these categorizers agree on a particular pattern, there is no difficulty. But suppose they disagree. How should a “super” classifier pool the evidence from the component recognizers to achieve the best decision? Imagine calling in ten experts for determining if a particular fish is diseased or not. While nine agree that the fish is healthy, one expert does not. Who is right? It may be that the lone dissenter is the only one familiar with the particular very rare symptoms in the fish, and is in fact correct. How would the “super” categorizer know when to base a decision on a minority opinion, even from an expert in one small domain who is not well qualified to judge throughout a broad range of problems? 1.3.12 Costs and Risks We should realize that a classifier rarely exists in a vacuum. Instead, it is generally to be used to recommend actions (put this fish in this bucket, put that fish in that bucket), each action having an associated cost or risk. Conceptually, the simplest such risk is the classification error: what percentage of new patterns are called the wrong category. However the notion of risk is far more general, as we shall see. We often design our classifier to recommend actions that minimize some total expected cost or risk. Thus, in some sense, the notion of category itself derives from the cost or task. How do we incorporate knowledge about such risks and how will they affect our classification decision? 16 CHAPTER 1. INTRODUCTION Finally, can we estimate the total risk and thus tell whether our classifier is ac- ceptable even before we field it? Can we estimate the lowest possible risk of any classifier, to see how close ours meets this ideal, or whether the problem is simply too hard overall? 1.3.13 Computational Complexity Some pattern recognition problems can be solved using algorithms that are highly impractical. For instance, we might try to hand label all possible 20 × 20 binary pixel images with a category label for optical character recognition, and use table lookup to classify incoming patterns. Although we might achieve error-free recognition, the labeling time and storage requirements would be quite prohibitive since it would require a labeling each of 220×20 ≈ 10120 patterns. Thus the computational complexity of different algorithms is of importance, especially for practical applications. In more general terms, we may ask how an algorithm scales as a function of the number of feature dimensions, or the number of patterns or the number of categories. What is the tradeoff between computational ease and performance? In some prob- lems we know we can design an excellent recognizer, but not within the engineering constraints. How can we optimize within such constraints? We are typically less concerned with the complexity of learning, which is done in the laboratory, than the complexity of making a decision, which is done with the fielded application. While computational complexity generally correlates with the complexity of the hypothe- sized model of the patterns, these two notions are conceptually different. This section has catalogued some of the central problems in classification. It has been found that the most effective methods for developing classifiers involve learning from examples, i.e., from a set of patterns whose category is known. Throughout this book, we shall see again and again how methods of learning relate to these central problems, and are essential in the building of classifiers. 1.4 Learning and Adaptation In the broadest sense, any method that incorporates information from training sam- ples in the design of a classifier employs learning. Because nearly all practical or interesting pattern recognition problems are so hard that we cannot guess classifi- cation decision ahead of time, we shall spend the great majority of our time here considering learning. Creating classifiers then involves posit some general form of model, or form of the classifier, and using training patterns to learn or estimate the unknown parameters of the model. Learning refers to some form of algorithm for reducing the error on a set of training data. A range of gradient descent algorithms that alter a classifier’s parameters in order to reduce an error measure now permeate the field of statistical pattern recognition, and these will demand a great deal of our attention. Learning comes in several general forms. 1.4.1 Supervised Learning In supervised learning, a teacher provides a category label or cost for each pattern in a training set, and we seek to reduce the sum of the costs for these patterns. How can we be sure that a particular learning algorithm is powerful enough to learn the solution to a given problem and that it will be stable to parameter variations? 1.5. CONCLUSION 17 How can we determine if it will converge in finite time, or scale reasonably with the number of training patterns, the number of input features or with the perplexity of the problem? How can we insure that the learning algorithm appropriately favors “simple” solutions (as in Fig. 1.6) rather than complicated ones (as in Fig. 1.5)? 1.4.2 Unsupervised Learning In unsupervised learning or clustering there is no explicit teacher, and the system forms clusters or “natural groupings” of the input patterns. “Natural” is always defined explicitly or implicitly in the clustering system itself, and given a particular set of patterns or cost function, different clustering algorithms lead to different clusters. Often the user will set the hypothesized number of different clusters ahead of time, but how should this be done? How do we avoid inappropriate representations? 1.4.3 Reinforcement Learning The most typical way to train a classifier is to present an input, compute its tentative category label, and use the known target category label to improve the classifier. For instance, in optical character recognition, the input might be an image of a character, the actual output of the classifier the category label “R,” and the desired output a “B.” In reinforcement learning or learning with a critic, no desired category signal is given; critic instead, the only teaching feedback is that the tentative category is right or wrong. This is analogous to a critic who merely states that something is right or wrong, but does not say specifically how it is wrong. (Thus only binary feedback is given to the classifier; reinforcement learning also describes the case where a single scalar signal, say some number between 0 and 1, is given by the teacher.) In pattern classification, it is most common that such reinforcement is binary — either the tentative decision is correct or it is not. (Of course, if our problem involves just two categories and equal costs for errors, then learning with a critic is equivalent to standard supervised learning.) How can the system learn which are important from such non-specific feedback? 1.5 Conclusion At this point the reader may be overwhelmed by the number, complexity and mag- nitude of these sub-problems. Further, these sub-problems are rarely addressed in isolation and they are invariably interrelated. Thus for instance in seeking to reduce the complexity of our classifier, we might affect its ability to deal with invariance. We point out, though, that the good news is at least three-fold: 1) there is an “existence proof” that many of these problems can indeed be solved — as demonstrated by hu- mans and other biological systems, 2) mathematical theories solving some of these problems have in fact been discovered, and finally 3) there remain many fascinating unsolved problems providing opportunities for progress. Summary by Chapters The overall organization of this book is to address first those cases where a great deal of information about the models is known (such as the probability densities, category labels,...) and to move, chapter by chapter, toward problems where the form of the 18 CHAPTER 1. INTRODUCTION distributions are unknown and even the category membership of training patterns is unknown. We begin in Chap. ?? (Bayes decision theory) by considering the ideal case in which the probability structure underlying the categories is known perfectly. While this sort of situation rarely occurs in practice, it permits us to determine the optimal (Bayes) classifier against which we can compare all other methods. Moreover in some problems it enables us to predict the error we will get when we generalize to novel patterns. In Chap. ?? (Maximum Likelihood and Bayesian Parameter Estimation) we address the case when the full probability structure underlying the categories is not known, but the general forms of their distributions are — i.e., the models. Thus the uncertainty about a probability distribution is represented by the values of some unknown parameters, and we seek to determine these parameters to attain the best categorization. In Chap. ?? (Nonparametric techniques) we move yet further from the Bayesian ideal, and assume that we have no prior parameterized knowledge about the underlying probability structure; in essence our classification will be based on information provided by training samples alone. Classic techniques such as the nearest-neighbor algorithm and potential functions play an important role here. We then in Chap. ?? (Linear Discriminant Functions) return somewhat toward the general approach of parameter estimation. We shall assume that the so-called “discriminant functions” are of a very particular form — viz., linear — in order to de- rive a class of incremental training rules. Next, in Chap. ?? (Nonlinear Discriminants and Neural Networks) we see how some of the ideas from such linear discriminants can be extended to a class of very powerful algorithms such as backpropagation and others for multilayer neural networks; these neural techniques have a range of use- ful properties that have made them a mainstay in contemporary pattern recognition research. In Chap. ?? (Stochastic Methods) we discuss simulated annealing by the Boltzmann learning algorithm and other stochastic methods. We explore the behavior of such algorithms with regard to the matter of local minima that can plague other neural methods. Chapter ?? (Non-metric Methods) moves beyond models that are statistical in nature to ones that can be best described by (logical) rules. Here we discuss tree-based algorithms such as CART (which can also be applied to statistical data) and syntactic based methods, such as grammar based, which are based on crisp rules. Chapter ?? (Theory of Learning) is both the most important chapter and the most difficult one in the book. Some of the results described there, such as the notion of capacity, degrees of freedom, the relationship between expected error and training set size, and computational complexity are subtle but nevertheless crucial both theoretically and practically. In some sense, the other chapters can only be fully understood (or used) in light of the results presented here; you cannot expect to solve important pattern classification problems without using the material from this chapter. We conclude in Chap. ?? (Unsupervised Learning and Clustering), by addressing the case when input training patterns are not labeled, and that our recognizer must determine the cluster structure. We also treat a related problem, that of learning with a critic, in which the teacher provides only a single bit of information during the presentation of a training pattern — “yes,” that the classification provided by the recognizer is correct, or “no,” it isn’t. Here algorithms for reinforcement learning will be presented. 1.5. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 19 Bibliographical and Historical Remarks Classification is among the first crucial steps in making sense of the blooming buzzing confusion of sensory data that intelligent systems confront. In the western world, the foundations of pattern recognition can be traced to Plato , later extended by Aristotle , who distinguished between an “essential property” (which would be shared by all members in a class or “natural kind” as he put it) from an “accidental property” (which could differ among members in the class). Pattern recognition can be cast as the problem of finding such essential properties of a category. It has been a central theme in the discipline of philosophical epistemology, the study of the nature of knowledge. A more modern treatment of some philosophical problems of pattern recognition, relating to the technical matter in the current book can be found in [22, 4, 18]. In the eastern world, the first Zen patriarch, Bodhidharma, would point at things and demand students to answer “What is that?” as a way of confronting the deepest issues in mind, the identity of objects, and the nature of classification and decision. A delightful and particularly insightful book on the foundations of artificial intelligence, including pattern recognition, is. Early technical treatments by Minsky and Rosenfeld are still valuable, as are a number of overviews and reference books. The modern literature on decision theory and pattern recognition is now overwhelming, and comprises dozens of journals, thousands of books and conference proceedings and innumerable articles; it continues to grow rapidly. While some disciplines such as statistics , machine learning and neural networks , expand the foundations of pattern recognition, others, such as computer vision [6, 19] and speech recognition rely on it heavily. Perceptual Psychology, Cognitive Science , Psychobiology and Neuroscience analyze how pattern recognition is achieved in humans and other animals. The extreme view that everything in human cognition — including rule-following and logic — can be reduced to pattern recognition is presented in. Pattern recognition techniques have been applied in virtually every scientific and technical discipline. 20 CHAPTER 1. INTRODUCTION Bibliography Aristotle, Robin Waterfield, and David Bostock. Physics. Oxford University Press, Oxford, UK, 1996. Allan Bloom. The Republic of Plato. Basic Books, New York, NY, 2nd edition, 1991. Bodhidharma. The Zen Teachings of Bodhidharma. North Point Press, San Francisco, CA, 1989. Mikhail M. Bongard. Pattern Recognition. Spartan Books, Washington, D.C., 1970. Chi-hau Chen, Louis François Pau, and Patrick S. P. Wang, editors. Handbook of Pattern Recognition & Computer Vision. World Scientific, Singapore, 2nd edition, 1993. Marty Fischler and Oscar Firschein. Readings in Computer Vision: Issues, Prob- lems, Principles and Paradigms. Morgan Kaufmann, San Mateo, CA, 1987. Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, NY, 2nd edition, 1990. John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, Redwood City, CA, 1991. Douglas Hofstadter. Gödel, Escher, Bach: an Eternal Golden Braid. Basic Books, Inc., New York, NY, 1979. Eric R. Kandel and James H. Schwartz. Principles of Neural Science. Elsevier, New York, NY, 2nd edition, 1985. Immanuel Kant. Critique of Pure Reason. Prometheus Books, New York, NY, 1990. George F. Luger. Cognitive Science: The Science of Intelligent Systems. Aca- demic Press, New York, NY, 1994. Howard Margolis. Patterns, Thinking, and Cognition: A Theory of Judgement. University of Chicago Press, Chicago, IL, 1987. Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IEEE, 49:8–30, 1961. 21 22 BIBLIOGRAPHY Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ, 1993. Azriel Rosenfeld. Picture Processing by Computer. Academic Press, New York, 1969. Jude W. Shavlik and Thomas G. Dietterich, editors. Readings in Machine Learn- ing. Morgan Kaufmann, San Mateo, CA, 1990. Brian Cantwell Smith. On the Origin of Objects. MIT Press, Cambridge, MA, 1996. Louise Stark and Kevin Bower. Generic Object Recognition using Form & Func- tion. World Scientific, River Edge, NJ, 1996. Donald R. Tveter. The Pattern Recognition basis of Artificial Intelligence. IEEE Press, New York, NY, 1998. William R. Uttal. The psychobiology of sensory coding. HarperCollins, New York, NY, 1973. Satoshi Watanabe. Knowing and Guessing: A quantitative study of inference and information. John Wiley, New York, NY, 1969. Index analysis by synthesis, 10 missing, 13 anticipatory coarticulation, see coartic- robust, 10 ulation, anticipatory space, 7 continuous, 8 beanbag chair vector, 7 example, 10 fingerprint identification, 3 BEATS example, see subset/superset fish categorization example, 3–9 camera for pattern recognition, 4 generalization, 8 classification, see pattern recognition grammar, 9 cost, 4, 15 model, 4 hardware, 3 risk, see classification, cost hypothesis clustering, see learning, unsupervised, null, see null hypothesis 17 hypothesis testing, 11 coarticulation anticipatory, 13 image complexity processing, 11 computational, see computational threshold, 4 complexity information computational complexity, 16 loss, 11 and feature dimensions, 16 invariance, 14–15 and number of categories, 16 illumination, 15 and number of patterns, 16 line thickness, 15 context, 14 jeetyet example, 14 decision boundary, 7, 8 knowledge complex, 8 incorporating, 10 simple, 10 prior, 12 decision theory, 5 deformations learning non-rigid, 15 and adaptation, 16 distribution, see probability, distribu- reinforcement, 17 tion supervised, 16 DNA sequence identification, 3 unsupervised, 17 evidence pooling, 15 machine perception, see perception, ma- chine feature memory extraction, 4, 11 associative, 11 23 24 INDEX mereology, 13 coarticulation in, 13 missing feature, see feature, missing skloo model, 4 coarticulation in, 13 selection, 12 speech recognition rate variation, 15 noise, 12 rounding, 13 null hypothesis, 11 subset/superset, 13 supervised learning, see learning, su- Occam pervised William of, 8 symmetry OCR, see optical character recognition discrete, 15 optical character recognition, 3 exhaustive training, 16 unsupervised learning, see learning, un- handwritten, 10 supervised rate variation, 15 segmentation, 14 William of Occam, see Occam, William orientation, 14 of overfitting, 8, 12 pattern classification, see pattern recog- nition pattern recognition, 3 general purpose, 9 information reduction, 11 neural, 9 statistical, 8 syntactic, 9 perception machine, 3 phoneme, 13 POLOPONY example, see subset/superset preprocessing, 4 prior knowledge, see knowledge, prior probability density, 9 distribution, 8 rate variation, 15 recognition chair example, 10 reinforcement learning, see learning, re- inforcement representation, 9 scatter plot, 7 segmentation, 4, 13 speech, 13 shadow, 15 significance threshold, 11 size, 14 sklee Contents 2 Bayesian decision theory 3 2.1 Introduction................................. 3 2.2 Bayesian Decision Theory – Continuous Features............ 7 2.2.1 Two-Category Classification.................... 8 2.3 Minimum-Error-Rate Classification.................... 9 2.3.1 *Minimax Criterion........................ 10 2.3.2 *Neyman-Pearson Criterion.................... 12 2.4 Classifiers, Discriminants and Decision Surfaces............. 13 2.4.1 The Multi-Category Case..................... 13 2.4.2 The Two-Category Case...................... 14 2.5 The Normal Density............................ 15 2.5.1 Univariate Density......................... 16 2.5.2 Multivariate Density........................ 17 2.6 Discriminant Functions for the Normal Density............. 19 2.6.1 Case 1: Σi = σ 2 I.......................... 19 2.6.2 Case 2: Σi = Σ........................... 23 2.6.3 Case 3: Σi = arbitrary....................... 25 Example 1: Decisions for Gaussian data................. 29 2.7 *Error Probabilities and Integrals..................... 30 2.8 *Error Bounds for Normal Densities................... 31 2.8.1 Chernoff Bound........................... 31 2.8.2 Bhattacharyya Bound....................... 32 Example 2: Error bounds......................... 33 2.8.3 Signal Detection Theory and Operating Characteristics.... 33 2.9 Bayes Decision Theory — Discrete Features............... 36 2.9.1 Independent Binary Features................... 36 Example 3: Bayes decisions for binary data............... 38 2.10 *Missing and Noisy Features....................... 39 2.10.1 Missing Features.......................... 39 2.10.2 Noisy Features........................... 40 2.11 *Compound Bayes Decision Theory and Context............ 41 Summary..................................... 42 Bibliographical and Historical Remarks.................... 43 Problems..................................... 44 Computer exercises................................ 59 Bibliography................................... 61 Index....................................... 65 1 2 CONTENTS Chapter 2 Bayesian decision theory 2.1 Introduction ayesian decision theory is a fundamental statistical approach to the problem of B pattern classification. This approach is based on quantifying the tradeoffs be- tween various classification decisions using probability and the costs that accompany such decisions. It makes the assumption that the decision problem is posed in proba- bilistic terms, and that all of the relevant probability values are known. In this chapter we develop the fundamentals of this theory, and show how it can be viewed as being simply a formalization of common-sense procedures; in subsequent chapters we will consider the problems that arise when the probabilistic structure is not completely known. While we will give a quite general, abstract development of Bayesian decision theory in Sect. ??, we begin our discussion with a specific example. Let us reconsider the hypothetical problem posed in Chap. ?? of designing a classifier to separate two kinds of fish: sea bass and salmon. Suppose that an observer watching fish arrive along the conveyor belt finds it hard to predict what type will emerge next and that the sequence of types of fish appears to be random. In decision-theoretic terminology we would say that as each fish emerges nature is in one or the other of the two possible states: either the fish is a sea bass or the fish is a salmon. We let ω denote the state state of of nature, with ω = ω1 for sea bass and ω = ω2 for salmon. Because the state of nature nature is so unpredictable, we consider ω to be a variable that must be described probabilistically. If the catch produced as much sea bass as salmon, we would say that the next fish is equally likely to be sea bass or salmon. More generally, we assume that there is some a priori probability (or simply prior) P (ω1 ) that the next fish is sea bass, and prior some prior probability P (ω2 ) that it is salmon. If we assume there are no other types of fish relevant here, then P (ω1 ) and P (ω2 ) sum to one. These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass or salmon before the fish actually appears. It might, for instance, depend upon the time of year or the choice of fishing area. Suppose for a moment that we were forced to make a decision about the type of fish that will appear next without being allowed to see it. For the moment, we shall 3 4 CHAPTER 2. BAYESIAN DECISION THEORY assume that any incorrect classification entails the same cost or consequence, and that the only information we are allowed to use is the value of the prior probabilities. If a decision must be made with so little information, it seems logical to use the following decision decision rule: Decide ω1 if P (ω1 ) > P (ω2 ); otherwise decide ω2. rule This rule makes sense if we are to judge just one fish, but if we are to judge many fish, using this rule repeatedly may seem a bit strange. After all, we would always make the same decision even though we know that both types of fish will appear. How well it works depends upon the values of the prior probabilities. If P (ω1 ) is very much greater than P (ω2 ), our decision in favor of ω1 will be right most of the time. If P (ω1 ) = P (ω2 ), we have only a fifty-fifty chance of being right. In general, the probability of error is the smaller of P (ω1 ) and P (ω2 ), and we shall see later that under these conditions no other decision rule can yield a larger probability of being right. In most circumstances we are not asked to make decisions with so little informa- tion. In our example, we might for instance use a lightness measurement x to improve our classifier. Different fish will yield different lightness readings and we express this variability in probabilistic terms; we consider x to be a continuous random variable whose distribution depends on the state of nature, and is expressed as p(x|ω1 ).∗ This is the class-conditional probability density function. Strictly speaking, the probabil- ity density function p(x|ω1 ) should be written as pX (x|ω1 ) to indicate that we are speaking about a particular density function for the random variable X. This more elaborate subscripted notation makes it clear that pX (·) and pY (·) denote two differ- ent functions, a fact that is obscured when writing p(x) and p(y). Since this potential confusion rarely arises in practice, we have elected to adopt the simpler notation. Readers who are unsure of our notation or who would like to review probability the- ory should see Appendix ??). This is the probability density function for x given that the state of nature is ω1. (It is also sometimes called state-conditional probability density.) Then the difference between p(x|ω1 ) and p(x|ω2 ) describes the difference in lightness between populations of sea bass and salmon (Fig. 2.1). Suppose that we know both the prior probabilities P (ωj ) and the conditional densities p(x|ωj ). Suppose further that we measure the lightness of a fish and discover that its value is x. How does this measurement influence our attitude concerning the true state of nature — that is, the category of the fish? We note first that the (joint) probability density of finding a pattern that is in category ωj and has feature value x can be written two ways: p(ωj , x) = P (ωj |x)p(x) = p(x|ωj )P (ωj ). Rearranging these leads us to the answer to our question, which is called Bayes’ formula: p(x|ωj )P (ωj ) P (ωj |x) = , (1) p(x) where in this case of two categories 2 p(x) = p(x|ωj )P (ωj ). (2) j=1 Bayes’ formula can be expressed informally in English by saying that likelihood × prior posterior =. (3) evidence ∗ We generally use an upper-case P (·) to denote a probability mass function and a lower-case p(·) to denote a probability density function. 2.1. INTRODUCTION 5 Bayes’ formula shows that by observing the value of x we can convert the prior probability P (ωj ) to the a posteriori probability (or posterior) probability P (ωj |x) posterior — the probability of the state of nature being ωj given that feature value x has been measured. We call p(x|ωj ) the likelihood of ωj with respect to x (a term likelihood chosen to indicate that, other things being equal, the category ωj for which p(x|ωj ) is large is more “likely” to be the true category). Notice that it is the product of the likelihood and the prior probability that is most important in determining the psterior probability; the evidence factor, p(x), can be viewed as merely a scale factor that guarantees that the posterior probabilities sum to one, as all good probabilities must. The variation of P (ωj |x) with x is illustrated in Fig. 2.2 for the case P (ω1 ) = 2/3 and P (ω2 ) = 1/3. p(x|ωi) 0.4 ω2 0.3 ω1 0.2 0.1 x 9 10 11 12 13 14 15 Figure 2.1: Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category ωi. If x represents the length of a fish, the two curves might describe the difference in length of populations of two types of fish. Density functions are normalized, and thus the area under each curve is 1.0. If we have an observation x for which P (ω1 |x) is greater than P (ω2 |x), we would naturally be inclined to decide that the true state of nature is ω1. Similarly, if P (ω2 |x) is greater than P (ω1 |x), we would be inclined to choose ω2. To justify this decision procedure, let us calculate the probability of error whenever we make a decision. Whenever we observe a particular x, P (ω1 |x) if we decide ω2 P (error|x) = (4) P (ω2 |x) if we decide ω1. Clearly, for a given x we can minimize the probability of error by deciding ω1 if P (ω1 |x) > P (ω2 |x) and ω2 otherwise. Of course, we may never observe exactly the same value of x twice. Will this rule minimize the average probability of error? Yes, because the average probability of error is given by 6 CHAPTER 2. BAYESIAN DECISION THEORY P(ωi|x) ω1 1 0.8 0.6 0.4 ω2 0.2 x 9 10 11 12 13 14 15 Figure 2.2: Posterior probabilities for the particular priors P(ω1 ) = 2/3 and P(ω2 ) = 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this case, given that a pattern is measured to have feature value x = 14, the probability it is in category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sum to 1.0. ∞ ∞ P (error) = P (error, x) dx = P (error|x)p(x) dx (5) −∞ −∞ and if for every x we insure that P (error|x) is as small as possible, then the integral must be as small as possible. Thus we have justified the following Bayes’ decision Bayes’ rule for minimizing the probability of error: decision rule Decide ω1 if P (ω1 |x) > P (ω2 |x); otherwise decide ω2 , (6) and under this rule Eq. 4 becomes P (error|x) = min [P (ω1 |x), P (ω2 |x)]. (7) This form of the decision rule emphasizes the role of the posterior probabilities. By using Eq. 1, we can instead express the rule in terms of the conditional and prior evidence probabilities. First note that the evidence, p(x), in Eq. 1 is unimportant as far as making a decision is concerned. It is basically just a scale factor that states how frequently we will actually measure a pattern with feature value x; its presence in Eq. 1 assures us that P (ω1 |x) + P (ω2 |x) = 1. By eliminating this scale factor, we obtain the following completely equivalent decision rule: Decide ω1 if p(x|ω1 )P (ω1 ) > p(x|ω2 )P (ω2 ); otherwise decide ω2. (8) Some additional insight can be obtained by considering a few special cases. If for some x we have p(x|ω1 ) = p(x|ω2 ), then that particular observation gives us no 2.2. BAYESIAN DECISION THEORY – CONTINUOUS FEATURES 7 information about the state of nature; in this case, the decision hinges entirely on the prior probabilities. On the other hand, if P (ω1 ) = P (ω2 ), then the states of nature are equally probable; in this case the decision is based entirely on the likelihoods p(x|ωj ). In general, both of these factors are important in making a decision, and the Bayes decision rule combines them to achieve the minimum probability of error. 2.2 Bayesian Decision Theory – Continuous Fea- tures We shall now formalize the ideas just considered, and generalize them in four ways: by allowing the use of more than one feature by allowing more than two states of nature by allowing actions other than merely deciding the state of nature by introducing a loss function more general than the probability of error. These generalizations and their attendant notational complexities should not ob- scure the central points illustrated in our simple example. Allowing the use of more than one feature merely requires replacing the scalar x by the feature vector x, where x is in a d-dimensional Euclidean space Rd , called the feature space. Allowing more feature than two states of nature provides us with a useful generalization for a small notational space expense. Allowing actions other than classification primarily allows the possibility of rejection, i.e., of refusing to make a decision in close cases; this is a useful option if being indecisive is not too costly. Formally, the loss function states exactly how costly loss each action is, and is used to convert a probability determination into a decision. Cost function functions let us treat situations in which some kinds of classification mistakes are more costly than others, although we often discuss the simplest case, where all errors are equally costly. With this as a preamble, let us begin the more formal treatment. Let ω1 ,..., ωc be the finite set of c states of nature (“categories”) and α1 ,..., αa be the finite set of a possible actions. The loss function λ(αi |ωj ) describes the loss incurred for taking action αi when the state of nature is ωj. Let the feature vector x be a d-component vector-valued random variable, and let p(x|ωj ) be the state- conditional probability density function for x — the probability density function for x conditioned on ωj being the true state of nature. As before, P (ωj ) describes the prior probability that nature is in state ωj. Then the posterior probability P (ωj |x) can be computed from p(x|ωj ) by Bayes’ formula: p(x|ωj )P (ωj ) P (ωj |x) = , (9) p(x) where the evidence is now c p(x) = p(x|ωj )P (ωj ). (10) j=1 Suppose that we observe a particular x and that we contemplate taking action αi. If the true state of nature is ωj , by definition we will incur the loss λ(αi |ωj ). Since P (ωj |x) is the probability that the true state of nature is ωj , the expected loss associated with taking action αi is merely 8 CHAPTER 2. BAYESIAN DECISION THEORY c R(αi |x) = λ(αi |ωj )P (ωj |x). (11) j=1 risk In decision-theoretic terminology, an expected loss is called a risk, and R(αi |x) is called the conditional risk. Whenever we encounter a particular observation x, we can minimize our expected loss by selecting the action that minimizes the conditional risk. We shall now show that this Bayes decision procedure actually provides the optimal performance on an overall risk. Stated formally, our problem is to find a decision rule against P (ωj ) that mini- decision mizes the overall risk. A general decision rule is a function α(x) that tells us which rule action to take for every possible observation. To be more specific, for every x the decision function α(x) assumes one of the a values α1 ,..., αa. The overall risk R is the expected loss associated with a given decision rule. Since R(αi |x) is the conditional risk associated with action αi , and since the decision rule specifies the action, the overall risk is given by R = R(α(x)|x)p(x) dx, (12) where dx is our notation for a d-space volume element, and where the integral extends over the entire feature space. Clearly, if α(x) is chosen so that R(αi (x)) is as small as possible for every x, then the overall risk will be minimized. This justifies the following statement of the Bayes decision rule: To minimize the overall risk, compute the conditional risk c R(αi |x) = λ(αi |ωj )P (ωj |x) (13) j=1 for i = 1,...,a and select the action αi for which R(αi |x) is minimum.∗ The resulting Bayes risk minimum overall risk is called the Bayes risk, denoted R∗ , and is the best performance that can be achieved. 2.2.1 Two-Category Classification Let us consider these results when applied to the special case of two-category classifi- cation problems. Here action α1 corresponds to deciding that the true state of nature is ω1 , and action α2 corresponds to deciding that it is ω2. For notational simplicity, let λij = λ(αi |ωj ) be the loss incurred for deciding ωi when the true state of nature is ωj. If we write out the conditional risk given by Eq. 13, we obtain R(α1 |x) = λ11 P (ω1 |x) + λ12 P (ω2 |x) and R(α2 |x) = λ21 P (ω1 |x) + λ22 P (ω2 |x). (14) There are a variety of ways of expressing the minimum-risk decision rule, each having its own minor advantages. The fundamental rule is to decide ω1 if R(α1 |x) < R(α2 |x). In terms of the posterior probabilities, we decide ω1 if (λ21 − λ11 )P (ω1 |x) > (λ12 − λ22 )P (ω2 |x). (15) ∗ Note that if more than one action minimizes R(α|x), it does not matter which of these actions is taken, and any convenient tie-breaking rule can be used. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 9 Ordinarily, the loss incurred for making an error is greater than the loss incurred for being correct, and both of the factors λ21 − λ11 and λ12 − λ22 are positive. Thus in practice, our decision is generally determined by the more likely state of nature, al- though we must scale the posterior probabilities by the loss differences. By employing Bayes’ formula, we can replace the posterior probabilities by the prior probabilities and the conditional densities. This results in the equivalent rule, to decide ω1 if (λ21 − λ11 )p(x|ω1 )P (ω1 ) > (λ12 − λ22 )p(x|ω2 )P (ω2 ), (16) and ω2 otherwise. Another alternative, which follows at once under the reasonable assumption that λ21 > λ11 , is to decide ω1 if p(x|ω1 ) λ12 − λ22 P (ω2 ) >. (17) p(x|ω2 ) λ21 − λ11 P (ω1 ) This form of the decision rule focuses on the x-dependence of the probability densities. We can consider p(x|ωj ) a function of ωj (i.e., the likelihood function), and then form likelihood the likelihood ratio p(x|ω1 )/p(x|ω2 ). Thus the Bayes decision rule can be interpreted ratio as calling for deciding ω1 if the likelihood ratio exceeds a threshold value that is independent of the observation x. 2.3 Minimum-Error-Rate Classification In classification problems, each state of nature is usually associated with a different one of the c classes, and the action αi is usually interpreted as the decision that the true state of nature is ωi. If action αi is taken and the true state of nature is ωj , then the decision is correct if i = j, and in error if i = j. If errors are to be avoided, it is natural to seek a decision rule that minimizes the probability of error, i.e., the error rate. The loss function of interest for this case is hence the so-called symmetrical or zero-one loss function, zero-one loss 0 i=j λ(αi |ωj ) = i, j = 1,..., c. (18) 1 i = j This loss function assigns no loss to a correct decision, and assigns a unit loss to any error; thus, all errors are equally costly.∗ The risk corresponding to this loss function is precisely the average probability of error, since the conditional risk is c R(αi |x) = λ(αi |ωj )P (ωj |x) j=1 = P (ωj |x) j=i = 1 − P (ωi |x) (19) ∗ We note that other loss functions, such as quadratic and linear-difference, find greater use in regression tasks, where there is a natural ordering on the predictions and we can meaningfully penalize predictions that are “more wrong” than others. 10 CHAPTER 2. BAYESIAN DECISION THEORY and P (ωi |x) is the conditional probability that action αi is correct. The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. Thus, to minimize the average probability of error, we should select the i that maximizes the posterior probability P (ωi |x). In other words, for minimum error rate: Decide ωi if P (ωi |x) > P (ωj |x) for all j = i. (20) This is the same rule as in Eq. 6. We saw in Fig. 2.2 some class-conditional probability densities and the posterior probabilities; Fig. 2.3 shows the likelihood ratio p(x|ω1 )/p(x|ω2 ) for the same case. In general, this ratio can range between zero and infinity. The threshold value θa marked is from the same prior probabilities but with zero-one loss function. Notice that this leads to the same decision boundaries as in Fig. 2.2, as it must. If we penalize mistakes in classifying ω1 patterns as ω2 more than the converse (i.e., λ21 > λ12 ), then Eq. 17 leads to the threshold θb marked. Note that the range of x values for which we classify a pattern as ω1 gets smaller, as it should. p(x|ω1) p(x|ω2) θb θa x R2 R1 R2 R1 Figure 2.3: The likelihood ratio p(x|ω1 )/p(x|ω2 ) for the distributions shown in Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are determined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1 patterns more than the converse, (i.e., λ12 > λ21 ), we get the larger threshold θb , and hence R1 becomes smaller. 2.3.1 *Minimax Criterion Sometimes we must design our classifier to perform well over a range of prior proba- bilities. For instance, in our fish categorization problem we can imagine that whereas the physical properties of lightness and width of each type of fish remain constant, the prior probabilities might vary widely and in an unpredictable way, or alternatively we want to use the classifier in a different plant where we do not know the prior probabilities. A reasonable approach is then to design our classifier so that the worst overall risk for any value of the priors is as small as possible — that is, minimize the maximum possible overall risk. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 11 In order to understand this, we let R1 denote that (as yet unknown) region in feature space where the classifier decides ω1 and likewise for R2 and ω2 , and then write our overall risk Eq. 12 in terms of conditional risks: R = [λ11 P (ω1 ) p(x|ω1 ) + λ12 P (ω2 ) p(x|ω2 )] dx R1 + [λ21 P (ω1 ) p(x|ω1 ) + λ22 P (ω2 ) p(x|ω2 )] dx. (21) R2 We use the fact that P (ω2 ) = 1 − P (ω1 ) and that p(x|ω1 ) dx = 1 − p(x|ω1 ) dx R1 R2 to rewrite the risk as: = Rmm , minimax risk R(P (ω1 )) = λ22 + (λ12 − λ22 ) p(x|ω2 ) dx (22) R1 + P (ω1 ) (λ11 − λ22 ) − (λ21 − λ11 ) p(x|ω1 ) dx − (λ12 − λ22 ) p(x|ω2 ) dx. R2 R1 = 0 for minimax solution This equation shows that once the decision boundary is set (i.e., R1 and R2 determined), the overall risk is linear in P (ω1 ). If we can find a boundary such that the constant of proportionality is 0, then the risk is independent of priors. This is the minimax solution, and the minimax risk, Rmm , can be read from Eq. 22: minimax risk Rmm = λ22 + (λ12 − λ22 ) p(x|ω2 ) dx R1 = λ11 + (λ21 − λ11 ) p(x|ω1 ) dx. (23) R2 Figure 2.4 illustrates the approach. Briefly stated, we search for the prior for which the Bayes risk is maximum, the corresponding decision boundary gives the minimax solution. The value of the minimax risk, Rmm , is hence equal to the worst Bayes risk. In practice, finding the decision boundary for minimax risk may be difficult, partic- ularly when distributions are complicated. Nevertheless, in some cases the boundary can be determined analytically (Problem 3). The minimax criterion finds greater use in game theory then it does in traditional pattern recognition. In game theory, you have a hostile opponent who can be expected to take an action maximally detrimental to you. Thus it makes great sense for you to take an action (e.g., make a classification) where your costs — due to your opponent’s subsequent actions — are minimized. 12 CHAPTER 2. BAYESIAN DECISION THEORY P(error).4.4.3.3.2.2.1.1 P(ω1) 0 0.2 0.4 0.6 0.8 1 Figure 2.4: The curve at the bottom shows the minimum (Bayes) error as a function of prior probability P (ω1 ) in a two-category classification problem of fixed distributions. For each value of the priors (e.g., P (ω1 ) = 0.25) there is a corresponding optimal decision boundary and associated Bayes error rate. For any (fixed) such boundary, if the priors are then changed, the probability of error will change as a linear function of P (ω1 ) (shown by the dashed line). The maximum such error will occur at an extreme value of the prior, here at P (ω1 ) = 1. To minimize the maximum of such error, we should design our decision boundary for the maximum Bayes error (here P (ω1 ) = 0.6), and thus the error will not change as a function of prior, as shown by the solid red horizontal line. 2.3.2 *Neyman-Pearson Criterion In some problems, we may wish to minimize the overall risk subject to a constraint; for instance, we might wish to minimize the total risk subject to the constraint R(αi |x) dx < constant for some particular i. Such a constraint might arise when there is a fixed resource that accompanies one particular action αi , or when we must not misclassify pattern from a particular state of nature ωi at more than some limited frequency. For instance, in our fish example, there might be some government regu- lation that we must not misclassify more than 1% of salmon as sea bass. We might then seek a decision that minimizes the chance of c