Computer Vision: Algorithms and Applications PDF

Document Details

Uploaded by Deleted User

Tags

computer vision image processing machine learning artificial intelligence

Summary

This document provides an introduction to computer vision, discussing its fundamentals and various applications. It explores how computer vision algorithms attempt to understand and reconstruct the world from images, highlighting the challenges involved. The document also details the diverse real-world applications of computer vision in fields ranging from industrial inspection to self-driving vehicles.

Full Transcript

1.1 What is computer vision? MODULE 1 3 1.1 What is computer vision? As humans, we perceive the three-dimensional structure of the world around us with appar- ent ease. Think of how vivid the three-dimensional percept is...

1.1 What is computer vision? MODULE 1 3 1.1 What is computer vision? As humans, we perceive the three-dimensional structure of the world around us with appar- ent ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers sitting on the table next to you. You can tell the shape and translucency of each petal through the subtle patterns of light and shading that play across its surface and effortlessly segment each flower from the background of the scene (Figure 1.1). Looking at a framed group portrait, you can easily count and name all of the people in the picture and even guess at their emotions from their facial expressions (Figure 1.2a). Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions1 to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains elusive (Marr 1982; Wandell 1995; Palmer 1999; Livingstone 2008; Frisby and Stone 2010). Researchers in computer vision have been developing, in parallel, mathematical tech- niques for recovering the three-dimensional shape and appearance of objects in imagery. Here, the progress in the last two decades has been rapid. We now have reliable techniques for accurately computing a 3D model of an environment from thousands of partially overlapping photographs (Figure 1.2c). Given a large enough set of views of a particular object or façade, we can create accurate dense 3D surface models using stereo matching (Figure 1.2d). We can even, with moderate success, delineate most of the people and objects in a photograph (Fig- ure 1.2a). However, despite all of these advances, the dream of having a computer explain an image at the same level of detail and causality as a two-year old remains elusive. Why is vision so difficult? In part, it is because it is an inverse problem, in which we seek to recover some unknowns given insufficient information to fully specify the solution. We must therefore resort to physics-based and probabilistic models, or machine learning from large sets of examples, to disambiguate between potential solutions. However, modeling the visual world in all of its rich complexity is far more difficult than, say, modeling the vocal tract that produces spoken sounds. The forward models that we use in computer vision are usually developed in physics (ra- diometry, optics, and sensor design) and in computer graphics. Both of these fields model how objects move and animate, how light reflects off their surfaces, is scattered by the atmo- sphere, refracted through camera lenses (or human eyes), and finally projected onto a flat (or curved) image plane. While computer graphics are not yet perfect, in many domains, such as rendering a still scene composed of everyday objects or animating extinct creatures such 1 Some fun pages with striking illusions include https://michaelbach.de/ot, https://www.illusionsindex.org, and http://www.ritsumei.ac.jp/∼akitaoka/index-e.html. 4 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) X X X X X X X O X O X O X X X X X X X X X X O X X X O X X X X X X X X O X X O X X O X X X X X X X X X O X O O X X X X X X X X O X X O X X X X X X X X X X X O X X X O X X X X X X X X O X X O X X O X X X X X X X X O X X X O X X X X X X X X X X X O O X X X X X X X X X X O X X X O X (c) (d) Figure 1.3 Some common optical illusions and what they might tell us about the visual system: (a) The classic Müller-Lyer illusion, where the lengths of the two horizontal lines appear different, probably due to the imagined perspective effects. (b) The “white” square B in the shadow and the “black” square A in the light actually have the same absolute intensity value. The percept is due to brightness constancy, the visual system’s attempt to discount illumination when interpreting colors. Image courtesy of Ted Adelson, http:// persci.mit.edu/ gallery/ checkershadow. (c) A variation of the Hermann grid illusion, courtesy of Hany Farid. As you move your eyes over the figure, gray spots appear at the intersections. (d) Count the red Xs in the left half of the figure. Now count them in the right half. Is it significantly harder? The explanation has to do with a pop-out effect (Treisman 1985), which tells us about the operations of parallel perception and integration pathways in the brain. 1.1 What is computer vision? 5 as dinosaurs, the illusion of reality is essentially there. In computer vision, we are trying to do the inverse, i.e., to describe the world that we see in one or more images and to reconstruct its properties, such as shape, illumination, and color distributions. It is amazing that humans and animals do this so effortlessly, while computer vision algorithms are so error prone. People who have not worked in the field often underestimate the difficulty of the problem. This misperception that vision should be easy dates back to the early days of artificial intelligence (see Section 1.2), when it was initially believed that the cognitive (logic proving and planning) parts of intelligence were intrinsically more difficult than the perceptual components (Boden 2006). The good news is that computer vision is being used today in a wide variety of real-world applications, which include: Optical character recognition (OCR): reading handwritten postal codes on letters (Figure 1.4a) and automatic number plate recognition (ANPR); Machine inspection: rapid parts inspection for quality assurance using stereo vision with specialized illumination to measure tolerances on aircraft wings or auto body parts (Figure 1.4b) or looking for defects in steel castings using X-ray vision; Retail: object recognition for automated checkout lanes and fully automated stores (Wingfield 2019); Warehouse logistics: autonomous package delivery and pallet-carrying “drives” (Guizzo 2008; O’Brian 2019) and parts picking by robotic manipulators (Figure 1.4c; Acker- man 2020); Medical imaging: registering pre-operative and intra-operative imagery (Figure 1.4d) or performing long-term studies of people’s brain morphology as they age; Self-driving vehicles: capable of driving point-to-point between cities (Figure 1.4e; Montemerlo, Becker et al. 2008; Urmson, Anhalt et al. 2008; Janai, Güney et al. 2020) as well as autonomous flight (Kaufmann, Gehrig et al. 2019); 3D model building (photogrammetry): fully automated construction of 3D models from aerial and drone photographs (Figure 1.4f); Match move: merging computer-generated imagery (CGI) with live action footage by tracking feature points in the source video to estimate the 3D camera motion and shape of the environment. Such techniques are widely used in Hollywood, e.g., in movies such as Jurassic Park (Roble 1999; Roble and Zafar 2009); they also require the use of 6 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.4 Some industrial applications of computer vision: (a) optical char- acter recognition (OCR), http:// yann.lecun.com/ exdb/ lenet; (b) mechanical inspection, http:// www.cognitens.com; (c) warehouse picking, https:// covariant.ai; (d) medical imaging, http:// www.clarontech.com; (e) self-driving cars, (Montemerlo, Becker et al. 2008) © 2008 Wiley; (f) drone-based photogrammetry, https:// www.pix4d.com/ blog/ mapping-chillon-castle-with-drone. 1.1 What is computer vision? 7 precise matting to insert new elements between foreground and background elements (Chuang, Agarwala et al. 2002). Motion capture (mocap): using retro-reflective markers viewed from multiple cam- eras or other vision-based techniques to capture actors for computer animation; Surveillance: monitoring for intruders, analyzing highway traffic and monitoring pools for drowning victims (e.g., https://swimeye.com); Fingerprint recognition and biometrics: for automatic access authentication as well as forensic applications. David Lowe’s website of industrial vision applications (http://www.cs.ubc.ca/spider/lowe/ vision.html) lists many other interesting industrial applications of computer vision. While the above applications are all extremely important, they mostly pertain to fairly specialized kinds of imagery and narrow domains. In addition to all of these industrial applications, there exist myriad consumer-level ap- plications, such as things you can do with your own personal photographs and video. These include: Stitching: turning overlapping photos into a single seamlessly stitched panorama (Fig- ure 1.5a), as described in Section 8.2; Exposure bracketing: merging multiple exposures taken under challenging lighting conditions (strong sunlight and shadows) into a single perfectly exposed image (Fig- ure 1.5b), as described in Section 10.2; Morphing: turning a picture of one of your friends into another, using a seamless morph transition (Figure 1.5c); 3D modeling: converting one or more snapshots into a 3D model of the object or person you are photographing (Figure 1.5d), as described in Section 13.6; Video match move and stabilization: inserting 2D pictures or 3D models into your videos by automatically tracking nearby reference points (see Section 11.4.4)2 or using motion estimates to remove shake from your videos (see Section 9.2.1); Photo-based walkthroughs: navigating a large collection of photographs, such as the interior of your house, by flying between different photos in 3D (see Sections 14.1.2 and 14.5.5); 2 For a fun student project on this topic, see the “PhotoBook” project at http://www.cc.gatech.edu/dvfx/videos/ dvfx2005.html. 8 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Face detection: for improved camera focusing as well as more relevant image search- ing (see Section 6.3.1); Visual authentication: automatically logging family members onto your home com- puter as they sit down in front of the webcam (see Section 6.2.4). The great thing about these applications is that they are already familiar to most students; they are, at least, technologies that students can immediately appreciate and use with their own personal media. Since computer vision is a challenging topic, given the wide range of mathematics being covered3 and the intrinsically difficult nature of the problems being solved, having fun and relevant problems to work on can be highly motivating and inspiring. The other major reason why this book has a strong focus on applications is that they can be used to formulate and constrain the potentially open-ended problems endemic in vision. Thus, it is better to think back from the problem at hand to suitable techniques, rather than to grab the first technique that you may have heard of. This kind of working back from problems to solutions is typical of an engineering approach to the study of vision and reflects my own background in the field. First, I come up with a detailed problem definition and decide on the constraints and specifications for the problem. Then, I try to find out which techniques are known to work, implement a few of these, evaluate their performance, and finally make a selection. In order for this process to work, it is important to have realistic test data, both synthetic, which can be used to verify correctness and analyze noise sensitivity, and real-world data typical of the way the system will finally be used. If machine learning is being used, it is even more important to have representative unbiased training data in sufficient quantity to obtain good results on real-world inputs. However, this book is not just an engineering text (a source of recipes). It also takes a scientific approach to basic vision problems. Here, I try to come up with the best possible models of the physics of the system at hand: how the scene is created, how light interacts with the scene and atmospheric effects, and how the sensors work, including sources of noise and uncertainty. The task is then to try to invert the acquisition process to come up with the best possible description of the scene. The book often uses a statistical approach to formulating and solving computer vision problems. Where appropriate, probability distributions are used to model the scene and the noisy image acquisition process. The association of prior distributions with unknowns is often called Bayesian modeling (Appendix B). It is possible to associate a risk or loss function with 3 These techniques include physics, Euclidean and projective geometry, statistics, and optimization. They make computer vision a fascinating field to study and a great way to learn techniques widely applicable in other fields. 1.1 What is computer vision? 9 (a) (b) (c) (d) Figure 1.5 Some consumer applications of computer vision: (a) image stitching: merging different views (Szeliski and Shum 1997) © 1997 ACM; (b) exposure bracketing: merging different exposures; (c) morphing: blending between two photographs (Gomes, Darsa et al. 1999) © 1999 Morgan Kaufmann; (d) smartphone augmented reality showing real-time depth occlusion effects (Valentin, Kowdle et al. 2018) © 2018 ACM. 10 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) misestimating the answer (Section B.2) and to set up your inference algorithm to minimize the expected risk. (Consider a robot trying to estimate the distance to an obstacle: it is usually safer to underestimate than to overestimate.) With statistical techniques, it often helps to gather lots of training data from which to learn probabilistic models. Finally, statistical approaches enable you to use proven inference techniques to estimate the best answer (or distribution of answers) and to quantify the uncertainty in the resulting estimates. Because so much of computer vision involves the solution of inverse problems or the esti- mation of unknown quantities, my book also has a heavy emphasis on algorithms, especially those that are known to work well in practice. For many vision problems, it is all too easy to come up with a mathematical description of the problem that either does not match realistic real-world conditions or does not lend itself to the stable estimation of the unknowns. What we need are algorithms that are both robust to noise and deviation from our models and rea- sonably efficient in terms of run-time resources and space. In this book, I go into these issues in detail, using Bayesian techniques, where applicable, to ensure robustness, and efficient search, minimization, and linear system solving algorithms to ensure efficiency.4 Most of the algorithms described in this book are at a high level, being mostly a list of steps that have to be filled in by students or by reading more detailed descriptions elsewhere. In fact, many of the algorithms are sketched out in the exercises. Now that I’ve described the goals of this book and the frameworks that I use, I devote the rest of this chapter to two additional topics. Section 1.2 is a brief synopsis of the history of computer vision. It can easily be skipped by those who want to get to “the meat” of the new material in this book and do not care as much about who invented what when. The second is an overview of the book’s contents, Section 1.3, which is useful reading for everyone who intends to make a study of this topic (or to jump in partway, since it describes chapter interdependencies). This outline is also useful for instructors looking to structure one or more courses around this topic, as it provides sample curricula based on the book’s contents. 1.2 A brief history In this section, I provide a brief personal synopsis of the main developments in computer vi- sion over the last fifty years (Figure 1.6) with a focus on advances I find personally interesting and that have stood the test of time. Readers not interested in the provenance of various ideas and the evolution of this field should skip ahead to the book overview in Section 1.3. 4 In some cases, deep neural networks have also been shown to be an effective way to speed up algorithms that previously relied on iteration (Chen, Xu, and Koltun 2017). 1.2 A brief history 11 1970 1980 1990 2000 2010 2020 Blocks world, line labeling Generalized cylinders Pattern recognition Stereo correspondence Optical flow Physically-based modeling Regularization Markov random fields Kalman filters Projective invariants Factorization Physics-based vision Graph cuts Particle filtering Energy-based segmentation Computational photography Feature-based recognition Machine learning Semantic segmentation SLAM and VIO Deep learning Vision and language Digital image processing Intrinsic images Structure from motion Image pyramids Shape from shading, texture, and focus 3D range data processing Face recognition and detection Texture synthesis and inpainting Image-based modeling and rendering Modeling and tracking humans Category recognition Figure 1.6 A rough timeline of some of the most active topics of research in computer vision. 1970s. When computer vision first started out in the early 1970s, it was viewed as the visual perception component of an ambitious agenda to mimic human intelligence and to endow robots with intelligent behavior. At the time, it was believed by some of the early pioneers of artificial intelligence and robotics (at places such as MIT, Stanford, and CMU) that solving the “visual input” problem would be an easy step along the path to solving more difficult problems such as higher-level reasoning and planning. According to one well-known story, in 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman to “spend the summer linking a camera to a computer and getting the computer to describe what it saw” (Boden 2006, p. 781).5 We now know that the problem is slightly more difficult than that.6 What distinguished computer vision from the already existing field of digital image pro- cessing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the three-dimensional structure of the world from images and to use this as a stepping stone to- wards full scene understanding. Winston (1975) and Hanson and Riseman (1978) provide two nice collections of classic papers from this early period. Early attempts at scene understanding involved extracting edges and then inferring the 5 Boden (2006) cites (Crevier 1993) as the original source. The actual Vision Memo was authored by Seymour Papert (1966) and involved a whole cohort of students. 6 To see how far robotic vision has come in the last six decades, have a look at some of the videos on the Boston Dynamics https://www.bostondynamics.com, Skydio https://www.skydio.com, and Covariant https://covariant.ai websites. 12 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.7 Some early (1970s) examples of computer vision algorithms: (a) line labeling (Nalwa 1993) © 1993 Addison-Wesley, (b) pictorial structures (Fischler and Elschlager 1973) © 1973 IEEE, (c) articulated body model (Marr 1982) © 1982 David Marr, (d) intrinsic images (Barrow and Tenenbaum 1981) © 1973 IEEE, (e) stereo correspondence (Marr 1982) © 1982 David Marr, (f) optical flow (Nagel and Enkelmann 1986) © 1986 IEEE. 3D structure of an object or a “blocks world” from the topological structure of the 2D lines (Roberts 1965). Several line labeling algorithms (Figure 1.7a) were developed at that time (Huffman 1971; Clowes 1971; Waltz 1975; Rosenfeld, Hummel, and Zucker 1976; Kanade 1980). Nalwa (1993) gives a nice review of this area. The topic of edge detection was also an active area of research; a nice survey of contemporaneous work can be found in (Davis 1975). Three-dimensional modeling of non-polyhedral objects was also being studied (Baum- gart 1974; Baker 1977). One popular approach used generalized cylinders, i.e., solids of revolution and swept closed curves (Agin and Binford 1976; Nevatia and Binford 1977), of- ten arranged into parts relationships7 (Hinton 1977; Marr 1982) (Figure 1.7c). Fischler and Elschlager (1973) called such elastic arrangements of parts pictorial structures (Figure 1.7b). A qualitative approach to understanding intensities and shading variations and explaining them by the effects of image formation phenomena, such as surface orientation and shadows, was championed by Barrow and Tenenbaum (1981) in their paper on intrinsic images (Fig- ure 1.7d), along with the related 2 1/2 -D sketch ideas of Marr (1982). This approach has seen 7 In robotics and computer animation, these linked-part graphs are often called kinematic chains. 1.2 A brief history 13 periodic revivals, e.g., in the work of Tappen, Freeman, and Adelson (2005) and Barron and Malik (2012). More quantitative approaches to computer vision were also developed at the time, in- cluding the first of many feature-based stereo correspondence algorithms (Figure 1.7e) (Dev 1974; Marr and Poggio 1976, 1979; Barnard and Fischler 1982; Ohta and Kanade 1985; Grimson 1985; Pollard, Mayhew, and Frisby 1985) and intensity-based optical flow algo- rithms (Figure 1.7f) (Horn and Schunck 1981; Huang 1981; Lucas and Kanade 1981; Nagel 1986). The early work in simultaneously recovering 3D structure and camera motion (see Chapter 11) also began around this time (Ullman 1979; Longuet-Higgins 1981). A lot of the philosophy of how vision was believed to work at the time is summarized in David Marr’s (1982) book.8 In particular, Marr introduced his notion of the three levels of description of a (visual) information processing system. These three levels, very loosely paraphrased according to my own interpretation, are: Computational theory: What is the goal of the computation (task) and what are the constraints that are known or can be brought to bear on the problem? Representations and algorithms: How are the input, output, and intermediate infor- mation represented and which algorithms are used to calculate the desired result? Hardware implementation: How are the representations and algorithms mapped onto actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con- versely, how can hardware constraints be used to guide the choice of representation and algorithm? With the prevalent use of graphics chips (GPUs) and many-core architec- tures for computer vision, this question is again quite relevant. As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the problem specification and known constraints from image formation and priors (the scientific and statistical approaches) must be married with efficient and robust algorithms (the engineer- ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy is as good a guide to framing and solving problems in our field today as it was 25 years ago. 1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical techniques for performing quantitative image and scene analysis. Image pyramids (see Section 3.5) started being widely used to perform tasks such as im- age blending (Figure 1.8a) and coarse-to-fine correspondence search (Rosenfeld 1980; Burt 8 More recent developments in visual perception theory are covered in (Wandell 1995; Palmer 1999; Livingstone 2008; Frisby and Stone 2010). 14 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.8 Examples of computer vision algorithms from the 1980s: (a) pyramid blending (Burt and Adelson 1983b) © 1983 ACM, (b) shape from shading (Freeman and Adelson 1991) © 1991 IEEE, (c) edge detection (Freeman and Adelson 1991) © 1991 IEEE, (d) physically based models (Terzopoulos and Witkin 1988) © 1988 IEEE, (e) regularization-based surface reconstruction (Terzopoulos 1988) © 1988 IEEE, (f) range data acquisition and merging (Banno, Masuda et al. 2008) © 2008 Springer. and Adelson 1983b; Rosenfeld 1984; Quam 1984; Anandan 1989). Continuous versions of pyramids using the concept of scale-space processing were also developed (Witkin 1983; Witkin, Terzopoulos, and Kass 1986; Lindeberg 1990). In the late 1980s, wavelets (see Sec- tion 3.5.4) started displacing or augmenting regular image pyramids in some applications (Mallat 1989; Simoncelli and Adelson 1990a; Simoncelli, Freeman et al. 1992). The use of stereo as a quantitative shape cue was extended by a wide variety of shape- from-X techniques, including shape from shading (Figure 1.8b) (see Section 13.1.1 and Horn 1975; Pentland 1984; Blake, Zisserman, and Knowles 1985; Horn and Brooks 1986, 1989), photometric stereo (see Section 13.1.1 and Woodham 1981), shape from texture (see Sec- tion 13.1.2 and Witkin 1981; Pentland 1984; Malik and Rosenholtz 1997), and shape from focus (see Section 13.1.3 and Nayar, Watanabe, and Noguchi 1995). Horn (1986) has a nice discussion of most of these techniques. Research into better edge and contour detection (Figure 1.8c) (see Section 7.2) was also active during this period (Canny 1986; Nalwa and Binford 1986), including the introduc- tion of dynamically evolving contour trackers (Section 7.3.1) such as snakes (Kass, Witkin, 1.2 A brief history 15 and Terzopoulos 1988), as well as three-dimensional physically based models (Figure 1.8d) (Terzopoulos, Witkin, and Kass 1987; Kass, Witkin, and Terzopoulos 1988; Terzopoulos and Fleischer 1988). Researchers noticed that a lot of the stereo, flow, shape-from-X, and edge detection al- gorithms could be unified, or at least described, using the same mathematical framework if they were posed as variational optimization problems and made more robust (well-posed) using regularization (Figure 1.8e) (see Section 4.2 and Terzopoulos 1983; Poggio, Torre, and Koch 1985; Terzopoulos 1986b; Blake and Zisserman 1987; Bertero, Poggio, and Torre 1988; Terzopoulos 1988). Around the same time, Geman and Geman (1984) pointed out that such problems could equally well be formulated using discrete Markov random field (MRF) models (see Section 4.3), which enabled the use of better (global) search and optimization algorithms, such as simulated annealing. Online variants of MRF algorithms that modeled and updated uncertainties using the Kalman filter were introduced a little later (Dickmanns and Graefe 1988; Matthies, Kanade, and Szeliski 1989; Szeliski 1989). Attempts were also made to map both regularized and MRF algorithms onto parallel hardware (Poggio and Koch 1985; Poggio, Little et al. 1988; Fischler, Firschein et al. 1989). The book by Fischler and Firschein (1987) contains a nice collection of articles focusing on all of these topics (stereo, flow, regularization, MRFs, and even higher-level vision). Three-dimensional range data processing (acquisition, merging, modeling, and recogni- tion; see Figure 1.8f) continued being actively explored during this decade (Agin and Binford 1976; Besl and Jain 1985; Faugeras and Hebert 1987; Curless and Levoy 1996). The compi- lation by Kanade (1987) contains a lot of the interesting papers in this area. 1990s. While a lot of the previously mentioned topics continued to be explored, a few of them became significantly more active. A burst of activity in using projective invariants for recognition (Mundy and Zisserman 1992) evolved into a concerted effort to solve the structure from motion problem (see Chap- ter 11). A lot of the initial activity was directed at projective reconstructions, which did not require knowledge of camera calibration (Faugeras 1992; Hartley, Gupta, and Chang 1992; Hartley 1994a; Faugeras and Luong 2001; Hartley and Zisserman 2004). Simultane- ously, factorization techniques (Section 11.4.1) were developed to solve efficiently problems for which orthographic camera approximations were applicable (Figure 1.9a) (Tomasi and Kanade 1992; Poelman and Kanade 1997; Anandan and Irani 2002) and then later extended to the perspective case (Christy and Horaud 1996; Triggs 1996). Eventually, the field started using full global optimization (see Section 11.4.2 and Taylor, Kriegman, and Anandan 1991; 16 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.9 Examples of computer vision algorithms from the 1990s: (a) factorization- based structure from motion (Tomasi and Kanade 1992) © 1992 Springer, (b) dense stereo matching (Boykov, Veksler, and Zabih 2001), (c) multi-view reconstruction (Seitz and Dyer 1999) © 1999 Springer, (d) face tracking (Matthews, Xiao, and Baker 2007), (e) image seg- mentation (Belongie, Fowlkes et al. 2002) © 2002 Springer, (f) face recognition (Turk and Pentland 1991). Szeliski and Kang 1994; Azarbayejani and Pentland 1995), which was later recognized as being the same as the bundle adjustment techniques traditionally used in photogrammetry (Triggs, McLauchlan et al. 1999). Fully automated 3D modeling systems were built using such techniques (Beardsley, Torr, and Zisserman 1996; Schaffalitzky and Zisserman 2002; Snavely, Seitz, and Szeliski 2006; Agarwal, Furukawa et al. 2011; Frahm, Fite-Georgel et al. 2010). Work begun in the 1980s on using detailed measurements of color and intensity combined with accurate physical models of radiance transport and color image formation created its own subfield known as physics-based vision. A good survey of the field can be found in the three- volume collection on this topic (Wolff, Shafer, and Healey 1992a; Healey and Shafer 1992; Shafer, Healey, and Wolff 1992). Optical flow methods (see Chapter 9) continued to be improved (Nagel and Enkelmann 1986; Bolles, Baker, and Marimont 1987; Horn and Weldon Jr. 1988; Anandan 1989; Bergen, 1.2 A brief history 17 Anandan et al. 1992; Black and Anandan 1996; Bruhn, Weickert, and Schnörr 2005; Papen- berg, Bruhn et al. 2006), with (Nagel 1986; Barron, Fleet, and Beauchemin 1994; Baker, Scharstein et al. 2011) being good surveys. Similarly, a lot of progress was made on dense stereo correspondence algorithms (see Chapter 12, Okutomi and Kanade (1993, 1994); Boykov, Veksler, and Zabih (1998); Birchfield and Tomasi (1999); Boykov, Veksler, and Zabih (2001), and the survey and comparison in Scharstein and Szeliski (2002)), with the biggest break- through being perhaps global optimization using graph cut techniques (Figure 1.9b) (Boykov, Veksler, and Zabih 2001). Multi-view stereo algorithms (Figure 1.9c) that produce complete 3D surfaces (see Sec- tion 12.7) were also an active topic of research (Seitz and Dyer 1999; Kutulakos and Seitz 2000) that continues to be active today (Seitz, Curless et al. 2006; Schöps, Schönberger et al. 2017; Knapitsch, Park et al. 2017). Techniques for producing 3D volumetric descriptions from binary silhouettes (see Section 12.7.3) continued to be developed (Potmesil 1987; Sri- vasan, Liang, and Hackwood 1990; Szeliski 1993; Laurentini 1994), along with techniques based on tracking and reconstructing smooth occluding contours (see Section 12.2.1 and Cipolla and Blake 1992; Vaillant and Faugeras 1992; Zheng 1994; Boyer and Berger 1997; Szeliski and Weiss 1998; Cipolla and Giblin 2000). Tracking algorithms also improved a lot, including contour tracking using active contours (see Section 7.3), such as snakes (Kass, Witkin, and Terzopoulos 1988), particle filters (Blake and Isard 1998), and level sets (Malladi, Sethian, and Vemuri 1995), as well as intensity-based (direct) techniques (Lucas and Kanade 1981; Shi and Tomasi 1994; Rehg and Kanade 1994), often applied to tracking faces (Figure 1.9d) (Lanitis, Taylor, and Cootes 1997; Matthews and Baker 2004; Matthews, Xiao, and Baker 2007) and whole bodies (Sidenbladh, Black, and Fleet 2000; Hilton, Fua, and Ronfard 2006; Moeslund, Hilton, and Krüger 2006). Image segmentation (see Section 7.5) (Figure 1.9e), a topic which has been active since the earliest days of computer vision (Brice and Fennema 1970; Horowitz and Pavlidis 1976; Riseman and Arbib 1977; Rosenfeld and Davis 1979; Haralick and Shapiro 1985; Pavlidis and Liow 1990), was also an active topic of research, producing techniques based on min- imum energy (Mumford and Shah 1989) and minimum description length (Leclerc 1989), normalized cuts (Shi and Malik 2000), and mean shift (Comaniciu and Meer 2002). Statistical learning techniques started appearing, first in the application of principal com- ponent eigenface analysis to face recognition (Figure 1.9f) (see Section 5.2.3 and Turk and Pentland 1991) and linear dynamical systems for curve tracking (see Section 7.3.1 and Blake and Isard 1998). Perhaps the most notable development in computer vision during this decade was the increased interaction with computer graphics (Seitz and Szeliski 1999), especially in the 18 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.10 Examples of computer vision algorithms from the 2000s: (a) image-based rendering (Gortler, Grzeszczuk et al. 1996), (b) image-based modeling (Debevec, Taylor, and Malik 1996) © 1996 ACM, (c) interactive tone mapping (Lischinski, Farbman et al. 2006) (d) texture synthesis (Efros and Freeman 2001), (e) feature-based recognition (Fergus, Perona, and Zisserman 2007), (f) region-based recognition (Mori, Ren et al. 2004) © 2004 IEEE. cross-disciplinary area of image-based modeling and rendering (see Chapter 14). The idea of manipulating real-world imagery directly to create new animations first came to prominence with image morphing techniques (Figure1.5c) (see Section 3.6.3 and Beier and Neely 1992) and was later applied to view interpolation (Chen and Williams 1993; Seitz and Dyer 1996), panoramic image stitching (Figure1.5a) (see Section 8.2 and Mann and Picard 1994; Chen 1995; Szeliski 1996; Szeliski and Shum 1997; Szeliski 2006a), and full light-field rendering (Figure 1.10a) (see Section 14.3 and Gortler, Grzeszczuk et al. 1996; Levoy and Hanrahan 1996; Shade, Gortler et al. 1998). At the same time, image-based modeling techniques (Fig- ure 1.10b) for automatically creating realistic 3D models from collections of images were also being introduced (Beardsley, Torr, and Zisserman 1996; Debevec, Taylor, and Malik 1996; Taylor, Debevec, and Malik 1996). 2000s. This decade continued to deepen the interplay between the vision and graphics fields, but more importantly embraced data-driven and learning approaches as core compo- 1.2 A brief history 19 nents of vision. Many of the topics introduced under the rubric of image-based rendering, such as image stitching (see Section 8.2), light-field capture and rendering (see Section 14.3), and high dynamic range (HDR) image capture through exposure bracketing (Figure1.5b) (see Section 10.2 and Mann and Picard 1995; Debevec and Malik 1997), were re-christened as computational photography (see Chapter 10) to acknowledge the increased use of such tech- niques in everyday digital photography. For example, the rapid adoption of exposure brack- eting to create high dynamic range images necessitated the development of tone mapping algorithms (Figure 1.10c) (see Section 10.2.1) to convert such images back to displayable results (Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Reinhard, Stark et al. 2002; Lischinski, Farbman et al. 2006). In addition to merging multiple exposures, tech- niques were developed to merge flash images with non-flash counterparts (Eisemann and Durand 2004; Petschnigg, Agrawala et al. 2004) and to interactively or automatically select different regions from overlapping images (Agarwala, Dontcheva et al. 2004). Texture synthesis (Figure 1.10d) (see Section 10.5), quilting (Efros and Leung 1999; Efros and Freeman 2001; Kwatra, Schödl et al. 2003), and inpainting (Bertalmio, Sapiro et al. 2000; Bertalmio, Vese et al. 2003; Criminisi, Pérez, and Toyama 2004) are additional topics that can be classified as computational photography techniques, since they re-combine input image samples to produce new photographs. A second notable trend during this decade was the emergence of feature-based techniques (combined with learning) for object recognition (see Section 6.1 and Ponce, Hebert et al. 2006). Some of the notable papers in this area include the constellation model of Fergus, Perona, and Zisserman (2007) (Figure 1.10e) and the pictorial structures of Felzenszwalb and Huttenlocher (2005). Feature-based techniques also dominate other recognition tasks, such as scene recognition (Zhang, Marszalek et al. 2007) and panorama and location recog- nition (Brown and Lowe 2007; Schindler, Brown, and Szeliski 2007). And while interest point (patch-based) features tend to dominate current research, some groups are pursuing recognition based on contours (Belongie, Malik, and Puzicha 2002) and region segmentation (Figure 1.10f) (Mori, Ren et al. 2004). Another significant trend from this decade was the development of more efficient al- gorithms for complex global optimization problems (see Chapter 4 and Appendix B.5 and Szeliski, Zabih et al. 2008; Blake, Kohli, and Rother 2011). While this trend began with work on graph cuts (Boykov, Veksler, and Zabih 2001; Kohli and Torr 2007), a lot of progress has also been made in message passing algorithms, such as loopy belief propagation (LBP) (Yedidia, Freeman, and Weiss 2001; Kumar and Torr 2006). The most notable trend from this decade, which has by now completely taken over visual recognition and most other aspects of computer vision, was the application of sophisticated 20 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.11 Examples of computer vision algorithms from the 2010s: (a) the SuperVision deep neural network © Krizhevsky, Sutskever, and Hinton (2012); (b) object instance seg- mentation (He, Gkioxari et al. 2017) © 2017 IEEE; (c) whole body, expression, and gesture fitting from a single image (Pavlakos, Choutas et al. 2019) © 2019 IEEE; (d) fusing mul- tiple color depth images using the KinectFusion real-time system (Newcombe, Izadi et al. 2011) © 2011 IEEE; (e) smartphone augmented reality with real-time depth occlusion effects (Valentin, Kowdle et al. 2018) © 2018 ACM; (f) 3D map computed in real-time on a fully autonomous Skydio R1 drone (Cross 2019). machine learning techniques to computer vision problems (see Chapters 5 and 6). This trend coincided with the increased availability of immense quantities of partially labeled data on the internet, as well as significant increases in computational power, which makes it more feasible to learn object categories without the use of careful human supervision. 2010s. The trend towards using large labeled (and also self-supervised) datasets to develop machine learning algorithms became a tidal wave that totally revolutionized the development of image recognition algorithms as well as other applications, such as denoising and optical flow, which previously used Bayesian and global optimization techniques. This trend was enabled by the development of high-quality large-scale annotated datasets such as ImageNet (Deng, Dong et al. 2009; Russakovsky, Deng et al. 2015), Microsoft COCO (Common Objects in Context) (Lin, Maire et al. 2014), and LVIS (Gupta, Dollár, and Gir- 1.2 A brief history 21 shick 2019). These datasets provided not only reliable metrics for tracking the progress of recognition and semantic segmentation algorithms, but more importantly, sufficient labeled data to develop complete solutions based on machine learning. Another major trend was the dramatic increase in computational power available from the development of general purpose (data-parallel) algorithms on graphical processing units (GPGPU). The breakthrough SuperVision (“AlexNet”) deep neural network (Figure 1.11a; Krizhevsky, Sutskever, and Hinton 2012), which was the first neural network to win the yearly ImageNet large-scale visual recognition challenge, relied on GPU training, as well as a number of technical advances, for its dramatic performance. After the publication of this paper, progress in using deep convolutional architectures accelerated dramatically, to the point where they are now the only architecture considered for recognition and semantic seg- mentation tasks (Figure 1.11b), as well as the preferred architecture for many other vision tasks (Chapter 5; LeCun, Bengio, and Hinton 2015), including optical flow (Sun, Yang et al. 2018)), denoising, and monocular depth inference (Li, Dekel et al. 2019). Large datasets and GPU architectures, coupled with the rapid dissemination of ideas through timely publications on arXiv as well as the development of languages for deep learn- ing and the open sourcing of neural network models, all contributed to an explosive growth in this area, both in rapid advances and capabilities, and also in the sheer number of publica- tions and researchers now working on these topics. They also enabled the extension of image recognition approaches to video understanding tasks such as action recognition (Feichten- hofer, Fan et al. 2019), as well as structured regression tasks such as real-time multi-person body pose estimation (Cao, Simon et al. 2017). Specialized sensors and hardware for computer vision tasks also continued to advance. The Microsoft Kinect depth camera, released in 2010, quickly became an essential component of many 3D modeling (Figure 1.11d) and person tracking (Shotton, Fitzgibbon et al. 2011) systems. Over the decade, 3D body shape modeling and tracking systems continued to evolve, to the point where it is now possible to infer a person’s 3D model with gestures and expression from a single image (Figure 1.11c). And while depth sensors have not yet become ubiquitous (except for security applications on high-end phones), computational photography algorithms run on all of today’s smart- phones. Innovations introduced in the computer vision community, such as panoramic image stitching and bracketed high dynamic range image merging, are now standard features, and multi-image low-light denoising algorithms are also becoming commonplace (Liba, Murthy et al. 2019). Lightfield imaging algorithms, which allow the creation of soft depth-of-field effects, are now also becoming more available (Garg, Wadhwa et al. 2019). Finally, mo- bile augmented reality applications that perform real-time pose estimation and environment 22 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) augmentation using combinations of feature tracking and inertial measurements are com- monplace, and are currently being extended to include pixel-accurate depth occlusion effects (Figure 1.11e). On higher-end platforms such as autonomous vehicles and drones, powerful real-time SLAM (simultaneous localization and mapping) and VIO (visual inertial odometry) algo- rithms (Engel, Schöps, and Cremers 2014; Forster, Zhang et al. 2017; Engel, Koltun, and Cremers 2018) can build accurate 3D maps that enable, e.g., autonomous flight through chal- lenging scenes such as forests (Figure 1.11f). In summary, this past decade has seen incredible advances in the performance and reli- ability of computer vision algorithms, brought in part by the shift to machine learning and training on very large sets of real-world data. It has also seen the application of vision algo- rithms in myriad commercial and consumer scenarios as well as new challenges engendered by their widespread use (Su and Crandall 2021). 1.3 Book overview In the final part of this introduction, I give a brief tour of the material in this book, as well as a few notes on notation and some additional general references. Since computer vision is such a broad field, it is possible to study certain aspects of it, e.g., geometric image formation and 3D structure recovery, without requiring other parts, e.g., the modeling of reflectance and shading. Some of the chapters in this book are only loosely coupled with others, and it is not strictly necessary to read all of the material in sequence. Figure 1.12 shows a rough layout of the contents of this book. Since computer vision involves going from images to both a semantic understanding as well as a 3D structural de- scription of the scene, I have positioned the chapters horizontally in terms of where in this spectrum they land, in addition to vertically according to their dependence.9 Interspersed throughout the book are sample applications, which relate the algorithms and mathematical material being presented in various chapters to useful, real-world applica- tions. Many of these applications are also presented in the exercises sections, so that students can write their own. At the end of each section, I provide a set of exercises that the students can use to imple- ment, test, and refine the algorithms and techniques presented in each section. Some of the exercises are suitable as written homework assignments, others as shorter one-week projects, 9 For an interesting comparison with what is known about the human visual system, e.g., the largely parallel what and where pathways (Goodale and Milner 1992), see some textbooks on human perception (Palmer 1999; Livingstone 2008; Frisby and Stone 2010). 66 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) light source ^ n image plane sensor plane surface optics Figure 2.14 A simplified model of photometric image formation. Light is emitted by one or more light sources and is then reflected from an object’s surface. A portion of this light is directed towards the camera. This simplified model ignores multiple reflections, which often occur in real-world scenes. essentially converted the sensor back into a linear imager and the previous decomposition still applies. 2.2 Photometric image formation In modeling the image formation process, we have described how 3D geometric features in the world are projected into 2D features in an image. However, images are not composed of 2D features. Instead, they are made up of discrete color or intensity values. Where do these values come from? How do they relate to the lighting in the environment, surface properties and geometry, camera optics, and sensor properties (Figure 2.14)? In this section, we develop a set of models to describe these interactions and formulate a generative process of image formation. A more detailed treatment of these topics can be found in textbooks on computer graphics and image synthesis (Cohen and Wallace 1993; Sillion and Puech 1994; Watt 1995; Glassner 1995; Weyrich, Lawrence et al. 2009; Hughes, van Dam et al. 2013; Marschner and Shirley 2015). 2.2.1 Lighting Images cannot exist without light. To produce an image, the scene must be illuminated with one or more light sources. (Certain modalities such as fluorescence microscopy and X-ray tomography do not fit this model, but we do not deal with them in this book.) Light sources can generally be divided into point and area light sources. 2.2 Photometric image formation 67 A point light source originates at a single location in space (e.g., a small light bulb), potentially at infinity (e.g., the Sun). (Note that for some applications such as modeling soft shadows (penumbras), the Sun may have to be treated as an area light source.) In addition to its location, a point light source has an intensity and a color spectrum, i.e., a distribution over wavelengths L(λ). The intensity of a light source falls off with the square of the distance between the source and the object being lit, because the same light is being spread over a larger (spherical) area. A light source may also have a directional falloff (dependence), but we ignore this in our simplified model. Area light sources are more complicated. A simple area light source such as a fluorescent ceiling light fixture with a diffuser can be modeled as a finite rectangular area emitting light equally in all directions (Cohen and Wallace 1993; Sillion and Puech 1994; Glassner 1995). When the distribution is strongly directional, a four-dimensional lightfield can be used instead (Ashdown 1993). A more complex light distribution that approximates, say, the incident illumination on an object sitting in an outdoor courtyard, can often be represented using an environment map (Greene 1986) (originally called a reflection map (Blinn and Newell 1976)). This representa- tion maps incident light directions v̂ to color values (or wavelengths, λ), L(v̂; λ), (2.81) and is equivalent to assuming that all light sources are at infinity. Environment maps can be represented as a collection of cubical faces (Greene 1986), as a single longitude–latitude map (Blinn and Newell 1976), or as the image of a reflecting sphere (Watt 1995). A convenient way to get a rough model of a real-world environment map is to take an image of a reflective mirrored sphere (sometimes accompanied by a darker sphere to capture highlights) and to unwrap this image onto the desired environment map (Debevec 1998). Watt (1995) gives a nice discussion of environment mapping, including the formulas needed to map directions to pixels for the three most commonly used representations. 2.2.2 Reflectance and shading When light hits an object’s surface, it is scattered and reflected (Figure 2.15a). Many different models have been developed to describe this interaction. In this section, we first describe the most general form, the bidirectional reflectance distribution function, and then look at some more specialized models, including the diffuse, specular, and Phong shading models. We also discuss how these models can be used to compute the global illumination corresponding to a scene. 68 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) n^ ^vr n^ v^i θi θr d^y φr φi d^x (a) (b) Figure 2.15 (a) Light scatters when it hits a surface. (b) The bidirectional reflectance distribution function (BRDF) f (θi , φi , θr , φr ) is parameterized by the angles that the inci- dent, v̂i , and reflected, v̂r , light ray directions make with the local surface coordinate frame (d̂x , d̂y , n̂). The Bidirectional Reflectance Distribution Function (BRDF) The most general model of light scattering is the bidirectional reflectance distribution func- tion (BRDF).8 Relative to some local coordinate frame on the surface, the BRDF is a four- dimensional function that describes how much of each wavelength arriving at an incident direction v̂i is emitted in a reflected direction v̂r (Figure 2.15b). The function can be written in terms of the angles of the incident and reflected directions relative to the surface frame as fr (θi , φi , θr , φr ; λ). (2.82) The BRDF is reciprocal, i.e., because of the physics of light transport, you can interchange the roles of v̂i and v̂r and still get the same answer (this is sometimes called Helmholtz reciprocity). Most surfaces are isotropic, i.e., there are no preferred directions on the surface as far as light transport is concerned. (The exceptions are anisotropic surfaces such as brushed (scratched) aluminum, where the reflectance depends on the light orientation relative to the direction of the scratches.) For an isotropic material, we can simplify the BRDF to fr (θi , θr , |φr − φi |; λ) or fr (v̂i , v̂r , n̂; λ), (2.83) as the quantities θi , θr , and φr − φi can be computed from the directions v̂i , v̂r , and n̂. 8 Actually, even more general models of light transport exist, including some that model spatial variation along the surface, sub-surface scattering, and atmospheric effects—see Section 13.7.1—(Dorsey, Rushmeier, and Sillion 2007; Weyrich, Lawrence et al. 2009). 2.2 Photometric image formation 69 Figure 2.16 This close-up of a statue shows both diffuse (smooth shading) and specular (shiny highlight) reflection, as well as darkening in the grooves and creases due to reduced light visibility and interreflections. (Photo courtesy of the Caltech Vision Lab, http:// www. vision.caltech.edu/ archive.html.) To calculate the amount of light exiting a surface point p in a direction v̂r under a given lighting condition, we integrate the product of the incoming light Li (v̂i ; λ) with the BRDF (some authors call this step a convolution). Taking into account the foreshortening factor cos+ θi , we obtain Z Lr (v̂r ; λ) = Li (v̂i ; λ)fr (v̂i , v̂r , n̂; λ) cos+ θi dv̂i , (2.84) where cos+ θi = max(0, cos θi ). (2.85) If the light sources are discrete (a finite number of point light sources), we can replace the integral with a summation, X Lr (v̂r ; λ) = Li (λ)fr (v̂i , v̂r , n̂; λ) cos+ θi. (2.86) i BRDFs for a given surface can be obtained through physical modeling (Torrance and Sparrow 1967; Cook and Torrance 1982; Glassner 1995), heuristic modeling (Phong 1975; Lafortune, Foo et al. 1997), or through empirical observation (Ward 1992; Westin, Arvo, and Torrance 1992; Dana, van Ginneken et al. 1999; Marschner, Westin et al. 2000; Matusik, Pfister et al. 2003; Dorsey, Rushmeier, and Sillion 2007; Weyrich, Lawrence et al. 2009; Shi, Mo et al. 2019).9 Typical BRDFs can often be split into their diffuse and specular components, as described below. 9 See http://www1.cs.columbia.edu/CAVE/software/curet for a database of some empirically sampled BRDFs. 70 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) v^i n ^ =1 0 < v^i n ^ δ3 f (t) = (2.107) t/(3δ 2 ) + 2δ/3 else, is a finite-slope approximation to the cube root with δ = 6/29. The resulting 0...100 scale roughly measures equal amounts of lightness perceptibility. 25 Another perceptually motivated color space called L*u*v* was developed and standardized simultaneously (Fairchild 2013). 92 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) In a similar fashion, the a* and b* components are defined as           ∗ X Y ∗ Y Z a = 500 f −f and b = 200 f −f , (2.108) Xn Yn Yn Zn where again, (Xn , Yn , Zn ) is the measured white point. Figure 2.33i–k show the L*a*b* representation for a sample color image. Color cameras While the preceding discussion tells us how we can uniquely describe the perceived tri- stimulus description of any color (spectral distribution), it does not tell us how RGB still and video cameras actually work. Do they just measure the amount of light at the nominal wavelengths of red (700.0nm), green (546.1nm), and blue (435.8nm)? Do color monitors just emit exactly these wavelengths and, if so, how can they emit negative red light to reproduce colors in the cyan range? In fact, the design of RGB video cameras has historically been based around the availabil- ity of colored phosphors that go into television sets. When standard-definition color television was invented (NTSC), a mapping was defined between the RGB values that would drive the three color guns in the cathode ray tube (CRT) and the XYZ values that unambiguously de- fine perceived color (this standard was called ITU-R BT.601). With the advent of HDTV and newer monitors, a new standard called ITU-R BT.709 was created, which specifies the XYZ values of each of the color primaries,      X 0.412453 0.357580 0.180423 R709       Y  = 0.212671 0.715160 0.072169 G709 . (2.109) Z 0.019334 0.119193 0.950227 B709 In practice, each color camera integrates light according to the spectral response function of its red, green, and blue sensors, Z R = L(λ)SR (λ)dλ, Z G = L(λ)SG (λ)dλ, (2.110) Z B = L(λ)SB (λ)dλ, where L(λ) is the incoming spectrum of light at a given pixel and {SR (λ), SG (λ), SB (λ)} are the red, green, and blue spectral sensitivities of the corresponding sensors. 2.3 The digital camera 93 Can we tell what spectral sensitivities the cameras actually have? Unless the camera manufacturer provides us with these data or we observe the response of the camera to a whole spectrum of monochromatic lights, these sensitivities are not specified by a standard such as BT.709. Instead, all that matters is that the tri-stimulus values for a given color produce the specified RGB values. The manufacturer is free to use sensors with sensitivities that do not match the standard XYZ definitions, so long as they can later be converted (through a linear transform) to the standard colors. Similarly, while TV and computer monitors are supposed to produce RGB values as spec- ified by Equation (2.109), there is no reason that they cannot use digital logic to transform the incoming RGB values into different signals to drive each of the color channels.26 Prop- erly calibrated monitors make this information available to software applications that perform color management, so that colors in real life, on the screen, and on the printer all match as closely as possible. Color filter arrays While early color TV cameras used three vidicons (tubes) to perform their sensing and later cameras used three separate RGB sensing chips, most of today’s digital still and video cam- eras use a color filter array (CFA), where alternating sensors are covered by different colored filters (Figure 2.24).27 The most commonly used pattern in color cameras today is the Bayer pattern (Bayer 1976), which places green filters over half of the sensors (in a checkerboard pattern), and red and blue filters over the remaining ones (Figure 2.31). The reason that there are twice as many green filters as red and blue is because the luminance signal is mostly determined by green values and the visual system is much more sensitive to high-frequency detail in luminance than in chrominance (a fact that is exploited in color image compression—see Section 2.3.3). The process of interpolating the missing color values so that we have valid RGB values for all the pixels is known as demosaicing and is covered in detail in Section 10.3.1. Similarly, color LCD monitors typically use alternating stripes of red, green, and blue filters placed in front of each liquid crystal active area to simulate the experience of a full color display. As before, because the visual system has higher resolution (acuity) in luminance than chrominance, it is possible to digitally prefilter RGB (and monochrome) images to enhance 26 The latest OLED TV monitors are now introducing higher dynamic range (HDR) and wide color gamut (WCG), https://www.cnet.com/how-to/what-is-wide-color-gamut-wcg. 27 A chip design by Foveon stacked the red, green, and blue sensors beneath each other, but it never gained widespread adoption. Descriptions of alternative color filter arrays that have been proposed over the years can be found at https://en.wikipedia.org/wiki/Color filter array. 94 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) G R G R rGb Rgb rGb Rgb B G B G rgB rGb rgB rGb G R G R rGb Rgb rGb Rgb B G B G rgB rGb rgB rGb (a) (b) Figure 2.31 Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, with unknown (guessed) values shown as lower case. the perception of crispness (Betrisey, Blinn et al. 2000; Platt 2000b). Color balance Before encoding the sensed RGB values, most cameras perform some kind of color balancing operation in an attempt to move the white point of a given image closer to pure white (equal RGB values). If the color system and the illumination are the same (the BT.709 system uses the daylight illuminant D65 as its reference white), the change may be minimal. However, if the illuminant is strongly colored, such as incandescent indoor lighting (which generally results in a yellow or orange hue), the compensation can be quite significant. A simple way to perform color correction is to multiply each of the RGB values by a different factor (i.e., to apply a diagonal matrix transform to the RGB color space). More complicated transforms, which are sometimes the result of mapping to XYZ space and back, actually perform a color twist, i.e., they use a general 3 × 3 color transform matrix.28 Exer- cise 2.8 has you explore some of these issues. Gamma In the early days of black and white television, the phosphors in the CRT used to display the TV signal responded non-linearly to their input voltage. The relationship between the voltage and the resulting brightness was characterized by a number called gamma (γ), since the formula was roughly B = V γ, (2.111) 28 Those of you old enough to remember the early days of color television will naturally think of the hue adjustment knob on the television set, which could produce truly bizarre results. 2.3 The digital camera 95 Y’ Y visible Y’ = Y1/γ noise Y = Y’γ Y quantization Y’ noise Figure 2.32 Gamma compression: (a) The relationship between the input signal luminance Y and the transmitted signal Y 0 is given by Y 0 = Y 1/γ. (b) At the receiver, the signal Y 0 is exponentiated by the factor γ, Ŷ = Y 0γ. Noise introduced during transmission is squashed in the dark regions, which corresponds to the more noise-sensitive region of the visual system. with a γ of about 2.2. To compensate for this effect, the electronics in the TV camera would pre-map the sensed luminance Y through an inverse gamma, 1 Y0 = Y γ, (2.112) with a typical value of γ1 = 0.45. The mapping of the signal through this non-linearity before transmission had a beneficial side effect: noise added during transmission (remember, these were analog days!) would be reduced (after applying the gamma at the receiver) in the darker regions of the signal where it was more visible (Figure 2.32).29 (Remember that our visual system is roughly sensitive to relative differences in luminance.) When color television was invented, it was decided to separately pass the red, green, and blue signals through the same gamma non-linearity before combining them for encoding. Today, even though we no longer have analog noise in our transmission systems, signals are still quantized during compression (see Section 2.3.3), so applying inverse gamma to sensed values remains useful. Unfortunately, for both computer vision and computer graphics, the presence of gamma in images is often problematic. For example, the proper simulation of radiometric phenomena such as shading (see Section 2.2 and Equation (2.88)) occurs in a linear radiance space. Once all of the computations have been performed, the appropriate gamma should be applied before display. Unfortunately, many computer graphics systems (such as shading models) operate 29 A related technique called companding was the basis of the Dolby noise reduction systems used with audio tapes. 96 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) directly on RGB values and display these values directly. (Fortunately, newer color imaging standards such as the 16-bit scRGB use a linear space, which makes this less of a problem (Glassner 1995).) In computer vision, the situation can be even more daunting. The accurate determination of surface normals, using a technique such as photometric stereo (Section 13.1.1) or even a simpler operation such as accurate image deblurring, require that the measurements be in a linear space of intensities. Therefore, it is imperative when performing detailed quantitative computations such as these to first undo the gamma and the per-image color re-balancing in the sensed color values. Chakrabarti, Scharstein, and Zickler (2009) develop a sophisti- cated 24-parameter model that is a good match to the processing performed by today’s digital cameras; they also provide a database of color images you can use for your own testing. For other vision applications, however, such as feature detection or the matching of sig- nals in stereo and motion estimation, this linearization step is often not necessary. In fact, determining whether it is necessary to undo gamma can take some careful thinking, e.g., in the case of compensating for exposure variations in image stitching (see Exercise 2.7). If all of these processing steps sound confusing to model, they are. Exercise 2.9 has you try to tease apart some of these phenomena using empirical investigation, i.e., taking pictures of color charts and comparing the RAW and JPEG compressed color values. Other color spaces While RGB and XYZ are the primary color spaces used to describe the spectral content (and hence tri-stimulus response) of color signals, a variety of other representations have been developed both in video and still image coding and in computer graphics. The earliest color representation developed for video transmission was the YIQ standard developed for NTSC video in North America and the closely related YUV standard developed for PAL in Europe. In both of these cases, it was desired to have a luma channel Y (so called since it only roughly mimics true luminance) that would be comparable to the regular black- and-white TV signal, along with two lower frequency chroma channels. In both systems, the Y signal (or more appropriately, the Y’ luma signal since it is gamma compressed) is obtained from 0 Y601 = 0.299R0 + 0.587G0 + 0.114B 0 , (2.113) where R0 G0 B0 is the triplet of gamma-compressed color components. When using the newer color definitions for HDTV in BT.709, the formula is 0 Y709 = 0.2125R0 + 0.7154G0 + 0.0721B 0. (2.114) 2.3 The digital camera 97 The UV components are derived from scaled versions of (B 0 −Y 0 ) and (R0 −Y 0 ), namely, U = 0.492111(B 0 − Y 0 ) and V = 0.877283(R0 − Y 0 ), (2.115) whereas the IQ components are the UV components rotated through an angle of 33°. In composite (NTSC and PAL) video, the chroma signals were then low-pass filtered horizon- tally before being modulated and superimposed on top of the Y 0 luma signal. Backward compatibility was achieved by having older black-and-white TV sets effectively ignore the high-frequency chroma signal (because of slow electronics) or, at worst, superimposing it as a high-frequency pattern on top of the main signal. While these conversions were important in the early days of computer vision, when frame grabbers would directly digitize the composite TV signal, today all digital video and still image compression standards are based on the newer YCbCr conversion. YCbCr is closely related to YUV (the Cb and Cr signals carry the blue and red color difference signals and have more useful mnemonics than UV) but uses different scale factors to fit within the eight-bit range available with digital signals. For video, the Y 0 signal is re-scaled to fit within the [16... 235] range of values, while the Cb and Cr signals are scaled to fit within [16... 240] (Gomes and Velho 1997; Fairchild 2013). For still images, the JPEG standard uses the full eight-bit range with no reserved values,        Y0 0.299 0.587 0.114 R0 0      0   Cb  = −0.168736 −0.331264 0.5  G  + 128 , (2.116) 0 Cr 0.5 −0.418688 −0.081312 B 128 where the R0 G0 B0 values are the eight-bit gamma-compressed color components (i.e., the actual RGB values we obtain when we open up or display a JPEG image). For most appli- cations, this formula is not that important, since your image reading software will directly provide you with the eight-bit gamma-compressed R0 G0 B0 values. However, if you are trying to do careful image deblocking (Exercise 4.3), this information may be useful. Another color space you may come across is hue, saturation, value (HSV), which is a projection of the RGB color cube onto a non-linear chroma angle, a radial saturation per- centage, and a luminance-inspired value. In more detail, value is defined as either the mean or maximum color value, saturation is defined as scaled distance from the diagonal, and hue is defined as the direction around a color wheel (the exact formulas are described by Hall (1989), Hughes, van Dam et al. (2013), and Brown (2019)). Such a decomposition is quite natural in graphics applications such as color picking (it approximates the Munsell chart for color description). Figure 2.33l–n shows an HSV representation of a sample color image, 98 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) where saturation is encoded using a gray scale (saturated = darker) and hue is depicted as a color. If you want your computer vision algorithm to only affect the value (luminance) of an image and not its saturation or hue, a simpler solution is to use either the Y xy (luminance + chromaticity) coordinates defined in (2.105) or the even simpler color ratios, R G B r= , g= , b= (2.117) R+G+B R+G+B R+G+B (Figure 2.33e–h). After manipulating the luma (2.113), e.g., through the process of histogram equalization (Section 3.1.4), you can multiply each color ratio by the ratio of the new to old luma to obtain an adjusted RGB triplet. While all of these color systems may sound confusing, in the end, it often may not mat- ter that much which one you use. Poynton, in his Color FAQ, https://www.poynton.com/ ColorFAQ.html, notes that the perceptually motivated L*a*b* system is qualitatively similar to the gamma-compressed R0 G0 B0 system we mostly deal with, since both have a fractional power scaling (which approximates a logarithmic response) between the actual intensity val- ues and the numbers being manipulated. As in all cases, think carefully about what you are trying to accomplish before deciding on a technique to use. 2.3.3 Compression The last stage in a camera’s processing pipeline is usually some form of image compression (unless you are using a lossless compression scheme such as camera RAW or PNG). All color video and image compression algorithms start by converting the signal into YCbCr (or some closely related variant), so that they can compress the luminance signal with higher fidelity than the chrominance signal. (Recall that the human visual system has poorer frequency response to color than to luminance changes.) In video, it is common to subsam- ple Cb and Cr by a factor of two horizontally; with still images (JPEG), the subsampling (averaging) occurs both horizontally and vertically. Once the luminance and chrominance images have been appropriately subsampled and separated into individual images, they are then passed to a block transform stage. The most common technique used here is the discrete cosine transform (DCT), which is a real-valued variant of the discrete Fourier transform (DFT) (see Section 3.4.1). The DCT is a reasonable approximation to the Karhunen–Loève or eigenvalue decomposition of natural image patches, i.e., the decomposition that simultaneously packs the most energy into the first coefficients and diagonalizes the joint covariance matrix among the pixels (makes transform coefficients statistically independent). Both MPEG and JPEG use 8 × 8 DCT transforms (Wallace 1991; 2.3 The digital camera 99 (a) RGB (b) R (c) G (d) B (e) rgb (f) r (g) g (h) b (i) L* (j) a* (k) b* (l) H (m) S (n) V Figure 2.33 Color space transformations: (a–d) RGB; (e–h) rgb. (i–k) L*a*b*; (l–n) HSV. Note that the rgb, L*a*b*, and HSV values are all re-scaled to fit the dynamic range of the printed page. 100 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Figure 2.34 Image compressed with JPEG at three quality settings. Note how the amount of block artifact and high-frequency aliasing (“mosquito noise”) increases from left to right. Le Gall 1991), although newer variants, including the new AV1 open standard,30 use smaller 4 × 4 or even 2 × 2 blocks. Alternative transformations, such as wavelets (Taubman and Marcellin 2002) and lapped transforms (Malvar 1990, 1998, 2000) are used in compression standards such as JPEG 2000 and JPEG XR. After transform coding, the coefficient values are quantized into a set of small integer values that can be coded using a variable bit length scheme such as a Huffman code or an arithmetic code (Wallace 1991; Marpe, Schwarz, and Wiegand 2003). (The DC (lowest fre- quency) coefficients are also adaptively predicted from the previous block’s DC values. The term “DC” comes from “direct current”, i.e., the non-sinusoidal or non-alternating part of a signal.) The step size in the quantization is the main variable controlled by the quality setting on the JPEG file (Figure 2.34). With video, it is also usual to perform block-based motion compensation, i.e., to encode the difference between each block and a predicted set of pixel values obtained from a shifted block in the previous frame. (The exception is the motion-JPEG scheme used in older DV camcorders, which is nothing more than a series of individually JPEG compressed image frames.) While basic MPEG uses 16 × 16 motion compensation blocks with integer motion values (Le Gall 1991), newer standards use adaptively sized blocks, sub-pixel motions, and the ability to reference blocks from older frames (Sullivan, Ohm et al. 2012). In order to recover more gracefully from failures and to allow for random access to the video stream, predicted P frames are interleaved among independently coded I frames. (Bi-directional B frames are also sometimes used.) The quality of a compression algorithm is usually reported using its peak signal-to-noise ratio (PSNR), which is derived from the average mean square error, 1 Xh ˆ i2 M SE = I(x) − I(x) , (2.118) n x 30 https://aomedia.org 2.4 Additional reading 101 ˆ where I(x) is the original uncompressed image and I(x) is its compressed counterpart, or equivalently, the root mean square error (RMS error), which is defined as √ RM S = M SE. (2.119) The PSNR is defined as 2 Imax Imax P SN R = 10 log10 = 20 log10 , (2.120) M SE RM S where Imax is the maximum signal extent, e.g., 255 for eight-bit images. While this is just a high-level sketch of how image compression works, it is useful to understand so that the artifacts introduced by such techniques can be compensated for in various computer vision applications. Note also that researchers are currently developing novel image and video compression algorithms based on deep neural networks, e.g., (Rippel and Bourdev 2017; Mentzer, Agustsson et al. 2019; Rippel, Nair et al. 2019) and https://www. compression.cc. It will be interesting to see what kinds of different artifacts these techniques produce. 2.4 Additional reading As we mentioned at the beginning of this chapter, this book provides but a brief summary of a very rich and deep set of topics, traditionally covered in a number of separate fields. A more thorough introduction to the geometry of points, lines, planes, and projections can be found in textbooks on multi-view geometry (Faugeras and Luong 2001; Hartley and Zisserman 2004) and computer graphics (Watt 1995; OpenGL-ARB 1997; Hughes, van Dam et al. 2013; Marschner and Shirley 2015). Topics covered in more depth include higher- order primitives such as quadrics, conics, and cubics, as well as three-view and multi-view geometry. The image formation (synthesis) process is traditionally taught as part of a computer graphics curriculum (Glassner 1995; Watt 1995; Hughes, van Dam et al. 2013; Marschner and Shirley 2015) but it is also studied in physics-based computer vision (Wolff, Shafer, and Healey 1992a). The behavior of camera lens systems is studied in optics (Möller 1988; Ray 2002; Hecht 2015). Some good books on color theory have been written by Healey and Shafer (1992), Wan- dell (1995), Wyszecki and Stiles (2000), and Fairchild (2013), with Livingstone (2008) pro- viding a more fun and informal introduction to the topic of color perception. Mark Fairchild’s page of color books and links31 lists many other sources. 31 http://markfairchild.org/WhyIsColor/books links.html. 108 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) coarse l=2 medium l=1 fine l=0 (c) (d) (e) (f) Figure 3.1 Some common image processing operations: (a) partial histogram equaliza- tion; (b) orientation map computed from the second-order steerable filter (Freeman 1992) © 1992 IEEE; (c) bilateral filter (Durand and Dorsey 2002) © 2002 ACM; (d) image pyramid; (e) Laplacian pyramid blending (Burt and Adelson 1983b) © 1983 ACM; (f) line-based image warping (Beier and Neely 1992) © 1992 ACM. 3.1 Point operators 109 Now that we have seen how images are formed through the interaction of 3D scene elements, lighting, and camera optics and sensors, let us look at the first stage in most computer vision algorithms, namely the use of image processing to preprocess the image and convert it into a form suitable for further analysis. Examples of such operations include exposure correction and color balancing, reducing image noise, increasing sharpness, or straightening the image by rotating it. Additional examples include image warping and image blending, which are often used for visual effects (Figures 3.1 and Section 3.6.3). While some may consider image processing to be outside the purview of computer vision, most computer vision applications, such as computational photography and even recognition, require care in designing the image processing stages to achieve acceptable results. In this chapter, we review standard image processing operators that map pixel values from one image to another. Image processing is often taught in electrical engineering departments as a follow-on course to an introductory course in signal processing (Oppenheim and Schafer 1996; Oppenheim, Schafer, and Buck 1999). There are several popular textbooks for image processing, including Gomes and Velho (1997), Jähne (1997), Pratt (2007), Burger and Burge (2009), and Gonzalez and Woods (2017). We begin this chapter with the simplest kind of image transforms, namely those that manipulate each pixel independently of its neighbors (Section 3.1). Such transforms are of- ten called point operators or point processes. Next, we examine neighborhood (area-based) operators, where each new pixel’s value depends on a small number of neighboring input values (Sections 3.2 and 3.3). A convenient tool to analyze (and sometimes accelerate) such neighborhood operations is the Fourier Transform, which we cover in Section 3.4. Neighbor- hood operators can be cascaded to form image pyramids and wavelets, which are useful for analyzing images at a variety of resolutions (scales) and for accelerating certain operations (Section 3.5). Another important class of global operators are geometric transformations, such as rotations, shears, and perspective deformations (Section 3.6). While this chapter covers classical image processing techniques that consist mostly of linear and non-linear filtering operations, the next two chapters introduce energy-based and Bayesian graphical models, i.e., Markov random fields (Chapter 4), and then deep convolu- tional networks (Chapter 5), both of which are now widely used in image processing applica- tions. 3.1 Point operators The simplest kinds of image processing transforms are point operators, where each output pixel’s value depends on only the corresponding input pixel value (plus, potentially, some 110 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 3.2 Some local image processing operations: (a) original image along with its three color (per-channel) histograms; (b) brightness increased (additive offset, b = 16); (c) contrast increased (multiplicative gain, a = 1.1); (d) gamma (partially) linearized (γ = 1.2); (e) full histogram equalization; (f) partial histogram equalization. 3.1 Point operators 111 160 140 45 60 98 127 132 133 137 133 120 46 65 98 123 126 128 131 133 100 47 65 96 115 119 123 135 137 range 80 47 63 91 107 113 122 138 134 60 40 50 59 80 97 110 123 133 134 20 49 53 68 83 97 113 128 133 0 15 S16 13 S15 S14 11 S13 50 50 58 70 84 102 116 126 S12 S11 9 domain S10 S9 7 S8 S7 domain 5 S6 S5 S4 3 S3 50 50 52 58 69 86 101 120 S2 1 S1 (a) (b) (c) (d) Figure 3.3 Visualizing image data: (a) original image; (b) cropped portion and scanline plot using an image inspection tool; (c) grid of numbers; (d) surface plot. For figures (c)–(d), the image was first converted to grayscale. globally collected information or parameters). Examples of such operators include brightness and contrast adjustments (Figure 3.2) as well as color correction and transformations. In the image processing literature, such operations are also known as point processes (Crane 1997).1 We begin this section with a quick review of simple point operators, such as brightness scaling and image addition. Next, we discuss how colors in images can be manipulated. We then present image compositing and matting operations, which play an important role in computational photography (Chapter 10) and computer graphics applications. Finally, we describe the more global process of histogram equalization. We close with an example appli- cation that manipulates tonal values (exposure and contrast) to improve image appearance. 3.1.1 Pixel transforms A general image processing operator is a function that takes one or more input images and produces an output image. In the continuous domain, this can be denoted as g(x) = h(f (x)) or g(x) = h(f0 (x),... , fn (x)), (3.1) where x is in the D-dimensional (usually D = 2 for images) domain of the input and output functions f and g, which operate over some range, which can either be scalar or vector- valued, e.g., for color images or 2D motion. For discrete (sampled) images, the domain consists of a finite number of pixel locations, x = (i, j), and we can write g(i, j) = h(f (i, j)). (3.2) Figure 3.3 shows how an image can be represented either by its color (appearance), as a grid of numbers, or as a two-dimensional function (surface plot). 1 In convolutional neural networks (Section 5.4), such operations are sometimes called 1 × 1 convolutions. 112 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Two commonly used point processes are multiplication and addition with a constant, g(x) = af (x) + b. (3.3) The parameters a > 0 and b are often called the gain and bias parameters; sometimes these parameters are said to control contrast and brightness, respectively (Figures 3.2b–c).2 The bias and gain parameters can also be spatially varying, g(x) = a(x)f (x) + b(x), (3.4) e.g., when simulating the graded density filter used by photographers to selectively darken the sky or when modeling vignetting in an optical system. Multiplicative gain (both global and spatially varying) is a linear operation, as it obeys the superposition principle, h(f0 + f1 ) = h(f0 ) + h(f1 ). (3.5) (We will have more to say about linear shift invariant operators in Section 3.2.) Operators such as image squaring (which is often used to get a local estimate of the energy in a band- pass filtered signal, see Section 3.5) are not linear. Another commonly used dyadic

Use Quizgecko on...
Browser
Browser