Computer vision Textbook.pdf
Document Details
Uploaded by SnazzySaxophone
Assumption University of Thailand
Full Transcript
Computer Vision: Algorithms and Applications 2nd Edition Richard Szeliski Final draft, September 30, 2021 © 2022 Springer This electronic draft was downloaded Dec_27,_2022 for the personal use...
Computer Vision: Algorithms and Applications 2nd Edition Richard Szeliski Final draft, September 30, 2021 © 2022 Springer This electronic draft was downloaded Dec_27,_2022 for the personal use of ??? [email protected] and may not be posted or re-distributed in any form. Please refer interested readers to the book’s Web site at https://szeliski.org/Book, where you can also provide feedback. This book is dedicated to my parents, Zdzisław and Jadwiga, and my family, Lyn, Anne, and Stephen. 1 Introduction 1 What is computer vision? A brief history Book overview Sample syllabus Notation n^ 2 Image formation 33 Geometric primitives and transformations Photometric image formation The digital camera 3 Image processing 107 Point operators Linear filtering Non-linear filtering Fourier transforms Pyramids and wavelets Geometric transformations 4 Model fitting and optimization 191 Scattered data interpolation Variational methods and regularization Markov random fields 5 Deep learning 235 Supervised learning Unsupervised learning Deep neural networks Convolutional networks More complex models 6 Recognition 343 Instance recognition Image classification Object detection Semantic segmentation Video understanding Vision and language 7 Feature detection and matching 417 Points and patches Edges and contours Contour tracking Lines and vanishing points Segmentation 8 Image alignment and stitching 501 Pairwise alignment Image stitching Global alignment Compositing 9 Motion estimation 555 Translational alignment Parametric motion Optical flow Layered motion 10 Computational photography 607 Photometric calibration High dynamic range imaging Super-resolution, denoising, and blur removal Image matting and compositing Texture analysis and synthesis 11 Structure from motion and SLAM 681 Geometric intrinsic calibration Pose estimation Two-frame structure from motion Multi-frame structure from motion Simultaneous localization and mapping (SLAM) 12 Depth estimation 749 Epipolar geometry Sparse correspondence Dense correspondence Local methods Global optimization Deep neural networks Multi-view stereo Monocular depth estimation 13 3D reconstruction 805 Shape from X 3D scanning Surface representations Point-based representations Volumetric representations Model-based reconstruction Recovering texture maps and albedos 14 Image-based rendering 861 View interpolation Layered depth images Light fields and Lumigraphs Environment mattes Video-based rendering Neural rendering Preface The seeds for this book were first planted in 2001 when Steve Seitz at the University of Wash- ington invited me to co-teach a course called “Computer Vision for Computer Graphics”. At that time, computer vision techniques were increasingly being used in computer graphics to create image-based models of real-world objects, to create visual effects, and to merge real- world imagery using computational photography techniques. Our decision to focus on the applications of computer vision to fun problems such as image stitching and photo-based 3D modeling from personal photos seemed to resonate well with our students. That initial course evolved into a more complete computer vision syllabus and project- oriented course structure that I used to co-teach general computer vision courses both at the University of Washington and at Stanford. (The latter was a course I co-taught with David Fleet in 2003.) Similar curricula were then adopted at a number of other universities and also incorporated into more specialized courses on computational photography. (For ideas on how to use this book in your own course, please see Table 1.1 in Section 1.4.) This book also reflects my 40 years’ experience doing computer vision research in cor- porate research labs, mostly at Digital Equipment Corporation’s Cambridge Research Lab, Microsoft Research, and Facebook. In pursuing my work, I have mostly focused on problems and solution techniques (algorithms) that have practical real-world applications and that work well in practice. Thus, this book has more emphasis on basic techniques that work under real- world conditions and less on more esoteric mathematics that has intrinsic elegance but less practical applicability. This book is suitable for teaching a senior-level undergraduate course in computer vision to students in both computer science and electrical engineering. I prefer students to have either an image processing or a computer graphics course as a prerequisite, so that they can spend less time learning general background mathematics and more time studying computer vision techniques. The book is also suitable for teaching graduate-level courses in computer vision, e.g., by delving into more specialized topics, and as a general reference to fundamental viii Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) techniques and the recent research literature. To this end, I have attempted wherever possible to at least cite the newest research in each sub-field, even if the technical details are too complex to cover in the book itself. In teaching our courses, we have found it useful for the students to attempt a number of small implementation projects, which often build on one another, in order to get them used to working with real-world images and the challenges that these present. The students are then asked to choose an individual topic for each of their small-group, final projects. (Sometimes these projects even turn into conference papers!) The exercises at the end of each chapter contain numerous suggestions for smaller mid-term projects, as well as more open-ended problems whose solutions are still active research topics. Wherever possible, I encourage students to try their algorithms on their own personal photographs, since this better motivates them, often leads to creative variants on the problems, and better acquaints them with the variety and complexity of real-world imagery. In formulating and solving computer vision problems, I have often found it useful to draw inspiration from four high-level approaches: Scientific: build detailed models of the image formation process and develop mathe- matical techniques to invert these in order to recover the quantities of interest (where necessary, making simplifying assumptions to make the mathematics more tractable). Statistical: use probabilistic models to quantify the prior likelihood of your unknowns and the noisy measurement processes that produce the input images, then infer the best possible estimates of your desired quantities and analyze their resulting uncertainties. The inference algorithms used are often closely related to the optimization techniques used to invert the (scientific) image formation processes. Engineering: develop techniques that are simple to describe and implement but that are also known to work well in practice. Test these techniques to understand their limitation and failure modes, as well as their expected computational costs (run-time performance). Data-driven: collect a representative set of test data (ideally, with labels or ground- truth answers) and use these data to either tune or learn your model parameters, or at least to validate and quantify its performance. These four approaches build on each other and are used throughout the book. My personal research and development philosophy (and hence the exercises in the book) have a strong emphasis on testing algorithms. It’s too easy in computer vision to develop an Preface ix algorithm that does something plausible on a few images rather than something correct. The best way to validate your algorithms is to use a three-part strategy. First, test your algorithm on clean synthetic data, for which the exact results are known. Second, add noise to the data and evaluate how the performance degrades as a function of noise level. Finally, test the algorithm on real-world data, preferably drawn from a wide variety of sources, such as photos found on the web. Only then can you truly know if your algorithm can deal with real-world complexity, i.e., images that do not fit some simplified model or assumptions. In order to help students in this process, Appendix C includes pointers to commonly used datasets and software libraries that contain implementations of a wide variety of computer vision algorithms, which can enable you to tackle more ambitious projects (with your in- structor’s consent). Notes on the Second Edition The last decade has seen a truly dramatic explosion in the performance and applicability of computer vision algorithms, much of it engendered by the application of machine learning algorithms to large amounts of visual training data (Su and Crandall 2021). Deep neural networks now play an essential role in so many vision algorithms that the new edition of this book introduces them early on as a fundamental technique that gets used extensively in subsequent chapters. The most notable changes in the second edition include: Machine learning, deep learning, and deep neural networks are introduced early on in Chapter 5, as they play just as fundamental a role in vision algorithms as more classi- cal techniques, such as image processing, graphical/probabilistic models, and energy minimization, which are introduced in the preceding two chapters. The recognition chapter has been moved earlier in the book to Chapter 6, since end-to- end deep learning systems no longer require the development of building blocks such as feature detection, matching, and segmentation. Many of the students taking vision classes are primarily interested in visual recognition, so presenting this material earlier in the course makes it easier for students to base their final project on these topics. This chapter also includes sections on semantic segmentation, video understanding, and vision and language. The application of neural networks and deep learning to myriad computer vision al- gorithms and applications, including flow and stereo, 3D shape modeling, and newly x Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) emerging fields such as neural rendering. New technologies such as SLAM (simultaneous localization and mapping) and VIO (visual inertial odometry) that now run reliably and are used in real-time applications such as augmented reality and autonomous navigation. In addition to these larger changes, the book has been updated to reflect the latest state-of- the-art techniques such as internet-scale image search and phone-based computational pho- tography. The new edition includes over 1500 new citations (papers) and has over 200 new figures. Acknowledgements I would like to gratefully acknowledge all of the people whose passion for research and inquiry as well as encouragement have helped me write this book. Steve Zucker at McGill University first introduced me to computer vision, taught all of his students to question and debate research results and techniques, and encouraged me to pursue a graduate career in this area. Takeo Kanade and Geoff Hinton, my PhD thesis advisors at Carnegie Mellon Univer- sity, taught me the fundamentals of good research, writing, and presentation and mentored several generations of outstanding students and researchers. They fired up my interest in vi- sual processing, 3D modeling, and statistical methods, while Larry Matthies introduced me to Kalman filtering and stereo matching. Geoff continues to inspire so many of us with this undiminished passion for trying to figure out “what makes the brain work”. It’s been a delight to see his pursuit of connectionist ideas bear so much fruit in this past decade. Demetri Terzopoulos was my mentor at my first industrial research job and taught me the ropes of successful publishing. Yvan Leclerc and Pascal Fua, colleagues from my brief in- terlude at SRI International, gave me new perspectives on alternative approaches to computer vision. During my six years of research at Digital Equipment Corporation’s Cambridge Research Lab, I was fortunate to work with a great set of colleagues, including Ingrid Carlbom, Gudrun Klinker, Keith Waters, William Hsu, Richard Weiss, Stéphane Lavallée, and Sing Bing Kang, as well as to supervise the first of a long string of outstanding summer interns, including David Tonnesen, Sing Bing Kang, James Coughlan, and Harry Shum. This is also where I began my long-term collaboration with Daniel Scharstein. At Microsoft Research, I had the outstanding fortune to work with some of the world’s best researchers in computer vision and computer graphics, including Michael Cohen, Matt Preface xi Uyttendaele, Sing Bing Kang, Harry Shum, Larry Zitnick, Sudipta Sinha, Drew Steedly, Si- mon Baker, Johannes Kopf, Neel Joshi, Krishnan Ramnath, Anandan, Phil Torr, Antonio Cri- minisi, Simon Winder, Matthew Brown, Michael Goesele, Richard Hartley, Hugues Hoppe, Stephen Gortler, Steve Shafer, Matthew Turk, Georg Petschnigg, Kentaro Toyama, Ramin Zabih, Shai Avidan, Patrice Simard, Chris Pal, Nebojsa Jojic, Patrick Baudisch, Dani Lischin- ski, Raanan Fattal, Eric Stollnitz, David Nistér, Blaise Aguera y Arcas, Andrew Fitzgibbon, Jamie Shotton, Wolf Kienzle, Piotr Dollar, and Ross Girshick. I was also lucky to have as in- terns such great students as Polina Golland, Simon Baker, Mei Han, Arno Schödl, Ron Dror, Ashley Eden, Jonathan Shade, Jinxiang Chai, Rahul Swaminathan, Yanghai Tsin, Sam Hasi- noff, Anat Levin, Matthew Brown, Eric Bennett, Vaibhav Vaish, Jan-Michael Frahm, James Diebel, Ce Liu, Josef Sivic, Grant Schindler, Colin Zheng, Neel Joshi, Sudipta Sinha, Zeev Farbman, Rahul Garg, Tim Cho, Yekeun Jeong, Richard Roberts, Varsha Hedau, Dilip Kr- ishnan, Adarsh Kowdle, Edward Hsiao, Yong Seok Heo, Fabian Langguth, Andrew Owens, and Tianfan Xue. Working with such outstanding students also gave me the opportunity to collaborate with some of their amazing advisors, including Bill Freeman, Irfan Essa, Marc Pollefeys, Michael Black, Marc Levoy, and Andrew Zisserman. Since moving to Facebook, I’ve had the pleasure to continue my collaborations with Michael Cohen, Matt Uyttendaele, Johannes Kopf, Wolf Kienzle, and Krishnan Ramnath, and also new colleagues including Kevin Matzen, Bryce Evans, Suhib Alsisan, Changil Kim, David Geraghty, Jan Herling, Nils Plath, Jan-Michael Frahm, True Price, Richard Newcombe, Thomas Whelan, Michael Goesele, Steven Lovegrove, Julian Straub, Simon Green, Brian Cabral, Michael Toksvig, Albert Para Pozzo, Laura Sevilla-Lara, Georgia Gkioxari, Justin Johnson, Chris Sweeney, and Vassileios Balntas. I’ve also had the pleasure to collaborate with some outstanding summer interns, including Tianfan Xue, Scott Wehrwein, Peter Hed- man, Joel Janai, Aleksander Hołyński, Xuan Luo, Rui Wang, Olivia Wiles, and Yulun Tian. I’d like to thank in particular Michael Cohen, my mentor, colleague, and friend for the last 25 years for his unwavering support of my sprint to complete this second edition. While working at Microsoft and Facebook, I’ve also had the opportunity to collaborate with wonderful colleagues at the University of Washington, where I hold an Affiliate Profes- sor appointment. I’m indebted to Tony DeRose and David Salesin, who first encouraged me to get involved with the research going on at UW, my long-time collaborators Brian Curless, Steve Seitz, Maneesh Agrawala, Sameer Agarwal, and Yasu Furukawa, as well as the students I have had the privilege to supervise and interact with, including Fréderic Pighin, Yung-Yu Chuang, Doug Zongker, Colin Zheng, Aseem Agarwala, Dan Goldman, Noah Snavely, Ian Simon, Rahul Garg, Ryan Kaminsky, Juliet Fiss, Aleksander Hołyński, and Yifan Wang. As I mentioned at the beginning of this preface, this book owes its inception to the vision course xii Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) that Steve Seitz invited me to co-teach, as well as to Steve’s encouragement, course notes, and editorial input. I’m also grateful to the many other computer vision researchers who have given me so many constructive suggestions about the book, including Sing Bing Kang, who was my in- formal book editor, Vladimir Kolmogorov, Daniel Scharstein, Richard Hartley, Simon Baker, Noah Snavely, Bill Freeman, Svetlana Lazebnik, Matthew Turk, Jitendra Malik, Alyosha Efros, Michael Black, Brian Curless, Sameer Agarwal, Li Zhang, Deva Ramanan, Olga Veksler, Yuri Boykov, Carsten Rother, Phil Torr, Bill Triggs, Bruce Maxwell, Rico Mal- var, Jana Košecká, Eero Simoncelli, Aaron Hertzmann, Antonio Torralba, Tomaso Poggio, Theo Pavlidis, Baba Vemuri, Nando de Freitas, Chuck Dyer, Song Yi, Falk Schubert, Roman Pflugfelder, Marshall Tappen, James Coughlan, Sammy Rogmans, Klaus Strobel, Shanmu- ganathan, Andreas Siebert, Yongjun Wu, Fred Pighin, Juan Cockburn, Ronald Mallet, Tim Soper, Georgios Evangelidis, Dwight Fowler, Itzik Bayaz, Daniel O’Connor, Srikrishna Bhat, and Toru Tamaki, who wrote the Japanese translation and provided many useful errata. For the second edition, I received significant help and advice from three key contributors. Daniel Scharstein helped me update the chapter on stereo, Matt Deitke contributed descrip- tions of the newest papers in deep learning, including the sections on transformers, variational autoencoders, and text-to-image synthesis, along with the exercises in Chapters 5 and 6 and some illustrations. Sing Bing Kang reviewed multiple drafts and provided useful sugges- tions. I’d also like to thank Andrew Glassner, whose book (Glassner 2018) and figures were a tremendous help, Justin Johnson, Sean Bell, Ishan Misra, David Fouhey, Michael Brown, Abdelrahman Abdelhamed, Frank Dellaert, Xinlei Chen, Ross Girshick, Andreas Geiger, Dmytro Mishkin, Aleksander Hołyński, Joel Janai, Christoph Feichtenhofer, Yuandong Tian, Alyosha Efros, Pascal Fua, Torsten Sattler, Laura Leal-Taixé, Aljosa Osep, Qunjie Zhou, Jiřı́ Matas, Eddy Ilg, Yann LeCun, Larry Jackel, Vasileios Balntas, Daniel DeTone, Zachary Teed, Junhwa Hur, Jun-Yan Zhu, Filip Radenović, Michael Zollhöfer, Matthias Nießner, An- drew Owens, Hervé Jégou, Luowei Zhou, Ricardo Martin Brualla, Pratul Srinivasan, Matteo Poggi, Fabio Tosi, Ahmed Osman, Dave Howell, Holger Heidrich, Howard Yen, Anton Papst, Syamprasad K. Rajagopalan, Abhishek Nagar, Vladimir Kuznetsov, Raphaël Fouque, Marian Ciobanu, Darko Simonovic, and Guilherme Schlinker. In preparing the second edition, I taught some of the new material in two courses that I helped co-teach in 2020 at Facebook and UW. I’d like to thank my co-instructors Jan- Michael Frahm, Michael Goesele, Georgia Gkioxari, Ross Girshick, Jakob Julian Engel, Daniel Scharstein, Fernando de la Torre, Steve Seitz, and Harpreet Sawhney, from whom I learned a lot about the latest techniques that are included in the new edition. I’d also like to thank the TAs, including David Geraghty, True Price, Kevin Matzen, Akash Bapat, Alek- Preface xiii sander Hołyński, Keunhong Park, and Svetoslav Kolev, for the wonderful job they did in cre- ating and grading the assignments. I’d like to give a special thanks to Justin Johnson, whose excellent class slides (Johnson 2020), based on earlier slides from Stanford (Li, Johnson, and Yeung 2019), taught me the fundamentals of deep learning and which I used extensively in my own class and in preparing the new chapter on deep learning. Shena Deuchers and Ian Kingston did a fantastic job copy-editing the first and second editions, respectively and suggesting many useful improvements, and Wayne Wheeler and Simon Rees at Springer were most helpful throughout the whole book publishing process. Keith Price’s Annotated Computer Vision Bibliography was invaluable in tracking down ref- erences and related work. If you have any suggestions for improving the book, please send me an e-mail, as I would like to keep the book as accurate, informative, and timely as possible. The last year of writing this second edition took place during the worldwide COVID-19 pandemic. I would like to thank all of the first responders, medical and front-line workers, and everyone else who helped get us through these difficult and challenging times and to acknowledge the impact that this and other recent tragedies have had on all of us. Lastly, this book would not have been possible or worthwhile without the incredible sup- port and encouragement of my family. I dedicate this book to my parents, Zdzisław and Jadwiga, whose love, generosity, and accomplishments always inspired me; to my sister Ba- sia for her lifelong friendship; and especially to Lyn, Anne, and Stephen, whose love and support in all matters (including my book projects) makes it all worthwhile. Lake Wenatchee May 2021 xiv Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Contents Preface vii Contents xv 1 Introduction 1 1.1 What is computer vision?............................ 3 1.2 A brief history.................................. 10 1.3 Book overview................................. 22 1.4 Sample syllabus................................. 30 1.5 A note on notation............................... 31 1.6 Additional reading............................... 31 2 Image formation 33 2.1 Geometric primitives and transformations................... 36 2.1.1 2D transformations........................... 40 2.1.2 3D transformations........................... 43 2.1.3 3D rotations............................... 45 2.1.4 3D to 2D projections.......................... 51 2.1.5 Lens distortions............................. 63 2.2 Photometric image formation.......................... 66 2.2.1 Lighting................................. 66 2.2.2 Reflectance and shading........................ 67 2.2.3 Optics.................................. 74 2.3 The digital camera............................... 79 2.3.1 Sampling and aliasing......................... 84 2.3.2 Color.................................. 87 2.3.3 Compression.............................. 98 xvi Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 2.4 Additional reading............................... 101 2.5 Exercises.................................... 102 3 Image processing 107 3.1 Point operators................................. 109 3.1.1 Pixel transforms............................ 111 3.1.2 Color transforms............................ 112 3.1.3 Compositing and matting........................ 113 3.1.4 Histogram equalization......................... 115 3.1.5 Application: Tonal adjustment..................... 119 3.2 Linear filtering................................. 119 3.2.1 Separable filtering........................... 124 3.2.2 Examples of linear filtering....................... 125 3.2.3 Band-pass and steerable filters..................... 127 3.3 More neighborhood operators.......................... 131 3.3.1 Non-linear filtering........................... 132 3.3.2 Bilateral filtering............................ 133 3.3.3 Binary image processing........................ 138 3.4 Fourier transforms............................... 142 3.4.1 Two-dimensional Fourier transforms.................. 146 3.4.2 Application: Sharpening, blur, and noise removal........... 148 3.5 Pyramids and wavelets............................. 149 3.5.1 Interpolation.............................. 150 3.5.2 Decimation............................... 153 3.5.3 Multi-resolution representations.................... 154 3.5.4 Wavelets................................ 159 3.5.5 Application: Image blending...................... 165 3.6 Geometric transformations........................... 168 3.6.1 Parametric transformations....................... 168 3.6.2 Mesh-based warping.......................... 175 3.6.3 Application: Feature-based morphing................. 177 3.7 Additional reading............................... 178 3.8 Exercises.................................... 180 4 Model fitting and optimization 191 4.1 Scattered data interpolation........................... 194 4.1.1 Radial basis functions......................... 196 Contents xvii 4.1.2 Overfitting and underfitting....................... 199 4.1.3 Robust data fitting........................... 202 4.2 Variational methods and regularization..................... 204 4.2.1 Discrete energy minimization..................... 206 4.2.2 Total variation............................. 210 4.2.3 Bilateral solver............................. 210 4.2.4 Application: Interactive colorization.................. 211 4.3 Markov random fields.............................. 212 4.3.1 Conditional random fields....................... 222 4.3.2 Application: Interactive segmentation................. 227 4.4 Additional reading............................... 230 4.5 Exercises.................................... 232 5 Deep Learning 235 5.1 Supervised learning............................... 239 5.1.1 Nearest neighbors............................ 241 5.1.2 Bayesian classification......................... 243 5.1.3 Logistic regression........................... 248 5.1.4 Support vector machines........................ 250 5.1.5 Decision trees and forests....................... 254 5.2 Unsupervised learning............................. 257 5.2.1 Clustering................................ 257 5.2.2 K-means and Gaussians mixture models................ 259 5.2.3 Principal component analysis..................... 262 5.2.4 Manifold learning............................ 265 5.2.5 Semi-supervised learning........................ 266 5.3 Deep neural networks.............................. 268 5.3.1 Weights and layers........................... 270 5.3.2 Activation functions.......................... 272 5.3.3 Regularization and normalization................... 274 5.3.4 Loss functions............................. 280 5.3.5 Backpropagation............................ 284 5.3.6 Training and optimization....................... 287 5.4 Convolutional neural networks......................... 291 5.4.1 Pooling and unpooling......................... 295 5.4.2 Application: Digit classification.................... 298 xviii Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 5.4.3 Network architectures......................... 299 5.4.4 Model zoos............................... 304 5.4.5 Visualizing weights and activations.................. 307 5.4.6 Adversarial examples.......................... 311 5.4.7 Self-supervised learning........................ 312 5.5 More complex models............................. 317 5.5.1 Three-dimensional CNNs....................... 317 5.5.2 Recurrent neural networks....................... 321 5.5.3 Transformers.............................. 322 5.5.4 Generative models........................... 328 5.6 Additional reading............................... 336 5.7 Exercises.................................... 337 6 Recognition 343 6.1 Instance recognition............................... 346 6.2 Image classification............................... 349 6.2.1 Feature-based methods......................... 350 6.2.2 Deep networks............................. 358 6.2.3 Application: Visual similarity search.................. 360 6.2.4 Face recognition............................ 363 6.3 Object detection................................. 370 6.3.1 Face detection............................. 371 6.3.2 Pedestrian detection.......................... 376 6.3.3 General object detection........................ 379 6.4 Semantic segmentation............................. 387 6.4.1 Application: Medical image segmentation............... 390 6.4.2 Instance segmentation......................... 391 6.4.3 Panoptic segmentation......................... 392 6.4.4 Application: Intelligent photo editing................. 394 6.4.5 Pose estimation............................. 395 6.5 Video understanding.............................. 396 6.6 Vision and language............................... 400 6.7 Additional reading............................... 409 6.8 Exercises.................................... 413 Contents xix 7 Feature detection and matching 417 7.1 Points and patches................................ 419 7.1.1 Feature detectors............................ 422 7.1.2 Feature descriptors........................... 434 7.1.3 Feature matching............................ 441 7.1.4 Large-scale matching and retrieval................... 448 7.1.5 Feature tracking............................ 452 7.1.6 Application: Performance-driven animation.............. 454 7.2 Edges and contours............................... 455 7.2.1 Edge detection............................. 456 7.2.2 Contour detection............................ 461 7.2.3 Application: Edge editing and enhancement.............. 465 7.3 Contour tracking................................ 466 7.3.1 Snakes and scissors........................... 467 7.3.2 Level Sets................................ 474 7.3.3 Application: Contour tracking and rotoscoping............ 476 7.4 Lines and vanishing points........................... 477 7.4.1 Successive approximation....................... 477 7.4.2 Hough transforms............................ 477 7.4.3 Vanishing points............................ 481 7.5 Segmentation.................................. 483 7.5.1 Graph-based segmentation....................... 486 7.5.2 Mean shift............................... 487 7.5.3 Normalized cuts............................ 489 7.6 Additional reading............................... 491 7.7 Exercises.................................... 495 8 Image alignment and stitching 501 8.1 Pairwise alignment............................... 503 8.1.1 2D alignment using least squares.................... 504 8.1.2 Application: Panography........................ 506 8.1.3 Iterative algorithms........................... 507 8.1.4 Robust least squares and RANSAC.................. 510 8.1.5 3D alignment.............................. 513 8.2 Image stitching................................. 514 8.2.1 Parametric motion models....................... 516 xx Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 8.2.2 Application: Whiteboard and document scanning........... 517 8.2.3 Rotational panoramas.......................... 519 8.2.4 Gap closing............................... 520 8.2.5 Application: Video summarization and compression......... 522 8.2.6 Cylindrical and spherical coordinates................. 523 8.3 Global alignment................................ 526 8.3.1 Bundle adjustment........................... 527 8.3.2 Parallax removal............................ 531 8.3.3 Recognizing panoramas........................ 533 8.4 Compositing................................... 536 8.4.1 Choosing a compositing surface.................... 536 8.4.2 Pixel selection and weighting (deghosting).............. 538 8.4.3 Application: Photomontage...................... 544 8.4.4 Blending................................ 544 8.5 Additional reading............................... 547 8.6 Exercises.................................... 549 9 Motion estimation 555 9.1 Translational alignment............................. 558 9.1.1 Hierarchical motion estimation..................... 562 9.1.2 Fourier-based alignment........................ 563 9.1.3 Incremental refinement......................... 566 9.2 Parametric motion................................ 570 9.2.1 Application: Video stabilization.................... 573 9.2.2 Spline-based motion.......................... 575 9.2.3 Application: Medical image registration................ 577 9.3 Optical flow................................... 578 9.3.1 Deep learning approaches....................... 584 9.3.2 Application: Rolling shutter wobble removal............. 587 9.3.3 Multi-frame motion estimation..................... 587 9.3.4 Application: Video denoising..................... 589 9.4 Layered motion................................. 589 9.4.1 Application: Frame interpolation.................... 593 9.4.2 Transparent layers and reflections................... 594 9.4.3 Video object segmentation....................... 597 9.4.4 Video object tracking.......................... 598 Contents xxi 9.5 Additional reading............................... 600 9.6 Exercises.................................... 602 10 Computational photography 607 10.1 Photometric calibration............................. 610 10.1.1 Radiometric response function..................... 611 10.1.2 Noise level estimation......................... 614 10.1.3 Vignetting................................ 615 10.1.4 Optical blur (spatial response) estimation............... 616 10.2 High dynamic range imaging.......................... 620 10.2.1 Tone mapping.............................. 627 10.2.2 Application: Flash photography.................... 634 10.3 Super-resolution, denoising, and blur removal................. 637 10.3.1 Color image demosaicing....................... 646 10.3.2 Lens blur (bokeh)............................ 648 10.4 Image matting and compositing......................... 650 10.4.1 Blue screen matting........................... 651 10.4.2 Natural image matting......................... 653 10.4.3 Optimization-based matting...................... 656 10.4.4 Smoke, shadow, and flash matting................... 661 10.4.5 Video matting.............................. 662 10.5 Texture analysis and synthesis......................... 663 10.5.1 Application: Hole filling and inpainting................ 665 10.5.2 Application: Non-photorealistic rendering............... 667 10.5.3 Neural style transfer and semantic image synthesis.......... 669 10.6 Additional reading............................... 671 10.7 Exercises.................................... 674 11 Structure from motion and SLAM 681 11.1 Geometric intrinsic calibration......................... 685 11.1.1 Vanishing points............................ 687 11.1.2 Application: Single view metrology.................. 688 11.1.3 Rotational motion........................... 689 11.1.4 Radial distortion............................ 691 11.2 Pose estimation................................. 693 11.2.1 Linear algorithms............................ 693 11.2.2 Iterative non-linear algorithms..................... 695 xxii Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 11.2.3 Application: Location recognition................... 698 11.2.4 Triangulation.............................. 701 11.3 Two-frame structure from motion........................ 703 11.3.1 Eight, seven, and five-point algorithms................. 703 11.3.2 Special motions and structures..................... 708 11.3.3 Projective (uncalibrated) reconstruction................ 710 11.3.4 Self-calibration............................. 712 11.3.5 Application: View morphing...................... 714 11.4 Multi-frame structure from motion....................... 715 11.4.1 Factorization.............................. 715 11.4.2 Bundle adjustment........................... 717 11.4.3 Exploiting sparsity........................... 719 11.4.4 Application: Match move....................... 723 11.4.5 Uncertainty and ambiguities...................... 723 11.4.6 Application: Reconstruction from internet photos........... 725 11.4.7 Global structure from motion...................... 728 11.4.8 Constrained structure and motion................... 731 11.5 Simultaneous localization and mapping (SLAM)............... 734 11.5.1 Application: Autonomous navigation................. 737 11.5.2 Application: Smartphone augmented reality.............. 739 11.6 Additional reading............................... 740 11.7 Exercises.................................... 743 12 Depth estimation 749 12.1 Epipolar geometry............................... 753 12.1.1 Rectification.............................. 755 12.1.2 Plane sweep............................... 757 12.2 Sparse correspondence............................. 760 12.2.1 3D curves and profiles......................... 760 12.3 Dense correspondence............................. 762 12.3.1 Similarity measures........................... 764 12.4 Local methods.................................. 766 12.4.1 Sub-pixel estimation and uncertainty.................. 768 12.4.2 Application: Stereo-based head tracking................ 769 12.5 Global optimization............................... 771 12.5.1 Dynamic programming......................... 774 Contents xxiii 12.5.2 Segmentation-based techniques.................... 775 12.5.3 Application: Z-keying and background replacement.......... 777 12.6 Deep neural networks.............................. 778 12.7 Multi-view stereo................................ 781 12.7.1 Scene flow............................... 785 12.7.2 Volumetric and 3D surface reconstruction............... 786 12.7.3 Shape from silhouettes......................... 794 12.8 Monocular depth estimation.......................... 796 12.9 Additional reading............................... 799 12.10Exercises.................................... 800 13 3D reconstruction 805 13.1 Shape from X.................................. 809 13.1.1 Shape from shading and photometric stereo.............. 809 13.1.2 Shape from texture........................... 814 13.1.3 Shape from focus............................ 814 13.2 3D scanning................................... 816 13.2.1 Range data merging.......................... 820 13.2.2 Application: Digital heritage...................... 824 13.3 Surface representations............................. 825 13.3.1 Surface interpolation.......................... 826 13.3.2 Surface simplification......................... 827 13.3.3 Geometry images............................ 828 13.4 Point-based representations........................... 829 13.5 Volumetric representations........................... 830 13.5.1 Implicit surfaces and level sets..................... 831 13.6 Model-based reconstruction........................... 833 13.6.1 Architecture............................... 833 13.6.2 Facial modeling and tracking...................... 838 13.6.3 Application: Facial animation..................... 839 13.6.4 Human body modeling and tracking.................. 843 13.7 Recovering texture maps and albedos..................... 850 13.7.1 Estimating BRDFs........................... 852 13.7.2 Application: 3D model capture..................... 854 13.8 Additional reading............................... 855 13.9 Exercises.................................... 857 xxiv Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 14 Image-based rendering 861 14.1 View interpolation................................ 863 14.1.1 View-dependent texture maps..................... 865 14.1.2 Application: Photo Tourism...................... 867 14.2 Layered depth images.............................. 868 14.2.1 Impostors, sprites, and layers...................... 869 14.2.2 Application: 3D photography..................... 872 14.3 Light fields and Lumigraphs.......................... 875 14.3.1 Unstructured Lumigraph........................ 879 14.3.2 Surface light fields........................... 880 14.3.3 Application: Concentric mosaics.................... 882 14.3.4 Application: Synthetic re-focusing................... 883 14.4 Environment mattes............................... 883 14.4.1 Higher-dimensional light fields..................... 885 14.4.2 The modeling to rendering continuum................. 886 14.5 Video-based rendering............................. 887 14.5.1 Video-based animation......................... 888 14.5.2 Video textures............................. 889 14.5.3 Application: Animating pictures.................... 892 14.5.4 3D and free-viewpoint Video...................... 893 14.5.5 Application: Video-based walkthroughs................ 896 14.6 Neural rendering................................ 899 14.7 Additional reading............................... 908 14.8 Exercises.................................... 910 15 Conclusion 915 A Linear algebra and numerical techniques 919 A.1 Matrix decompositions............................. 920 A.1.1 Singular value decomposition..................... 921 A.1.2 Eigenvalue decomposition....................... 922 A.1.3 QR factorization............................ 925 A.1.4 Cholesky factorization......................... 925 A.2 Linear least squares............................... 927 A.2.1 Total least squares........................... 929 A.3 Non-linear least squares............................. 930 A.4 Direct sparse matrix techniques......................... 932 Preface xxv A.4.1 Variable reordering........................... 932 A.5 Iterative techniques............................... 934 A.5.1 Conjugate gradient........................... 934 A.5.2 Preconditioning............................. 936 A.5.3 Multigrid................................ 937 B Bayesian modeling and inference 939 B.1 Estimation theory................................ 941 B.2 Maximum likelihood estimation and least squares............... 943 B.3 Robust statistics................................. 945 B.4 Prior models and Bayesian inference...................... 948 B.5 Markov random fields.............................. 949 B.6 Uncertainty estimation (error analysis)..................... 952 C Supplementary material 953 C.1 Datasets and benchmarks............................ 954 C.2 Software..................................... 961 C.3 Slides and lectures............................... 970 References 973 Index 1179 xxvi Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Chapter 1 Introduction 1.1 What is computer vision?............................ 3 1.2 A brief history.................................. 10 1.3 Book overview................................. 22 1.4 Sample syllabus................................. 30 1.5 A note on notation............................... 31 1.6 Additional reading............................... 31 Figure 1.1 The human visual system has no problem interpreting the subtle variations in translucency and shading in this photograph and correctly segmenting the object from its background. 2 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) Figure 1.2 Some examples of computer vision algorithms and applications. (a) Face de- tection algorithms, coupled with color-based clothing and hair detection algorithms, can locate and recognize the individuals in this image (Sivic, Zitnick, and Szeliski 2006) © 2006 Springer. (b) Object instance segmentation can delineate each person and object in a com- plex scene (He, Gkioxari et al. 2017) © 2017 IEEE. (c) Structure from motion algorithms can reconstruct a sparse 3D point model of a large complex scene from hundreds of par- tially overlapping photographs (Snavely, Seitz, and Szeliski 2006) © 2006 ACM. (d) Stereo matching algorithms can build a detailed 3D model of a building façade from hundreds of differently exposed photographs taken from the internet (Goesele, Snavely et al. 2007) © 2007 IEEE. 1.1 What is computer vision? 3 1.1 What is computer vision? As humans, we perceive the three-dimensional structure of the world around us with appar- ent ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers sitting on the table next to you. You can tell the shape and translucency of each petal through the subtle patterns of light and shading that play across its surface and effortlessly segment each flower from the background of the scene (Figure 1.1). Looking at a framed group portrait, you can easily count and name all of the people in the picture and even guess at their emotions from their facial expressions (Figure 1.2a). Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions1 to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains elusive (Marr 1982; Wandell 1995; Palmer 1999; Livingstone 2008; Frisby and Stone 2010). Researchers in computer vision have been developing, in parallel, mathematical tech- niques for recovering the three-dimensional shape and appearance of objects in imagery. Here, the progress in the last two decades has been rapid. We now have reliable techniques for accurately computing a 3D model of an environment from thousands of partially overlapping photographs (Figure 1.2c). Given a large enough set of views of a particular object or façade, we can create accurate dense 3D surface models using stereo matching (Figure 1.2d). We can even, with moderate success, delineate most of the people and objects in a photograph (Fig- ure 1.2a). However, despite all of these advances, the dream of having a computer explain an image at the same level of detail and causality as a two-year old remains elusive. Why is vision so difficult? In part, it is because it is an inverse problem, in which we seek to recover some unknowns given insufficient information to fully specify the solution. We must therefore resort to physics-based and probabilistic models, or machine learning from large sets of examples, to disambiguate between potential solutions. However, modeling the visual world in all of its rich complexity is far more difficult than, say, modeling the vocal tract that produces spoken sounds. The forward models that we use in computer vision are usually developed in physics (ra- diometry, optics, and sensor design) and in computer graphics. Both of these fields model how objects move and animate, how light reflects off their surfaces, is scattered by the atmo- sphere, refracted through camera lenses (or human eyes), and finally projected onto a flat (or curved) image plane. While computer graphics are not yet perfect, in many domains, such as rendering a still scene composed of everyday objects or animating extinct creatures such 1 Some fun pages with striking illusions include https://michaelbach.de/ot, https://www.illusionsindex.org, and http://www.ritsumei.ac.jp/∼akitaoka/index-e.html. 4 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) X X X X X X X O X O X O X X X X X X X X X X O X X X O X X X X X X X X O X X O X X O X X X X X X X X X O X O O X X X X X X X X O X X O X X X X X X X X X X X O X X X O X X X X X X X X O X X O X X O X X X X X X X X O X X X O X X X X X X X X X X X O O X X X X X X X X X X O X X X O X (c) (d) Figure 1.3 Some common optical illusions and what they might tell us about the visual system: (a) The classic Müller-Lyer illusion, where the lengths of the two horizontal lines appear different, probably due to the imagined perspective effects. (b) The “white” square B in the shadow and the “black” square A in the light actually have the same absolute intensity value. The percept is due to brightness constancy, the visual system’s attempt to discount illumination when interpreting colors. Image courtesy of Ted Adelson, http:// persci.mit.edu/ gallery/ checkershadow. (c) A variation of the Hermann grid illusion, courtesy of Hany Farid. As you move your eyes over the figure, gray spots appear at the intersections. (d) Count the red Xs in the left half of the figure. Now count them in the right half. Is it significantly harder? The explanation has to do with a pop-out effect (Treisman 1985), which tells us about the operations of parallel perception and integration pathways in the brain. 1.1 What is computer vision? 5 as dinosaurs, the illusion of reality is essentially there. In computer vision, we are trying to do the inverse, i.e., to describe the world that we see in one or more images and to reconstruct its properties, such as shape, illumination, and color distributions. It is amazing that humans and animals do this so effortlessly, while computer vision algorithms are so error prone. People who have not worked in the field often underestimate the difficulty of the problem. This misperception that vision should be easy dates back to the early days of artificial intelligence (see Section 1.2), when it was initially believed that the cognitive (logic proving and planning) parts of intelligence were intrinsically more difficult than the perceptual components (Boden 2006). The good news is that computer vision is being used today in a wide variety of real-world applications, which include: Optical character recognition (OCR): reading handwritten postal codes on letters (Figure 1.4a) and automatic number plate recognition (ANPR); Machine inspection: rapid parts inspection for quality assurance using stereo vision with specialized illumination to measure tolerances on aircraft wings or auto body parts (Figure 1.4b) or looking for defects in steel castings using X-ray vision; Retail: object recognition for automated checkout lanes and fully automated stores (Wingfield 2019); Warehouse logistics: autonomous package delivery and pallet-carrying “drives” (Guizzo 2008; O’Brian 2019) and parts picking by robotic manipulators (Figure 1.4c; Acker- man 2020); Medical imaging: registering pre-operative and intra-operative imagery (Figure 1.4d) or performing long-term studies of people’s brain morphology as they age; Self-driving vehicles: capable of driving point-to-point between cities (Figure 1.4e; Montemerlo, Becker et al. 2008; Urmson, Anhalt et al. 2008; Janai, Güney et al. 2020) as well as autonomous flight (Kaufmann, Gehrig et al. 2019); 3D model building (photogrammetry): fully automated construction of 3D models from aerial and drone photographs (Figure 1.4f); Match move: merging computer-generated imagery (CGI) with live action footage by tracking feature points in the source video to estimate the 3D camera motion and shape of the environment. Such techniques are widely used in Hollywood, e.g., in movies such as Jurassic Park (Roble 1999; Roble and Zafar 2009); they also require the use of 6 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.4 Some industrial applications of computer vision: (a) optical char- acter recognition (OCR), http:// yann.lecun.com/ exdb/ lenet; (b) mechanical inspection, http:// www.cognitens.com; (c) warehouse picking, https:// covariant.ai; (d) medical imaging, http:// www.clarontech.com; (e) self-driving cars, (Montemerlo, Becker et al. 2008) © 2008 Wiley; (f) drone-based photogrammetry, https:// www.pix4d.com/ blog/ mapping-chillon-castle-with-drone. 1.1 What is computer vision? 7 precise matting to insert new elements between foreground and background elements (Chuang, Agarwala et al. 2002). Motion capture (mocap): using retro-reflective markers viewed from multiple cam- eras or other vision-based techniques to capture actors for computer animation; Surveillance: monitoring for intruders, analyzing highway traffic and monitoring pools for drowning victims (e.g., https://swimeye.com); Fingerprint recognition and biometrics: for automatic access authentication as well as forensic applications. David Lowe’s website of industrial vision applications (http://www.cs.ubc.ca/spider/lowe/ vision.html) lists many other interesting industrial applications of computer vision. While the above applications are all extremely important, they mostly pertain to fairly specialized kinds of imagery and narrow domains. In addition to all of these industrial applications, there exist myriad consumer-level ap- plications, such as things you can do with your own personal photographs and video. These include: Stitching: turning overlapping photos into a single seamlessly stitched panorama (Fig- ure 1.5a), as described in Section 8.2; Exposure bracketing: merging multiple exposures taken under challenging lighting conditions (strong sunlight and shadows) into a single perfectly exposed image (Fig- ure 1.5b), as described in Section 10.2; Morphing: turning a picture of one of your friends into another, using a seamless morph transition (Figure 1.5c); 3D modeling: converting one or more snapshots into a 3D model of the object or person you are photographing (Figure 1.5d), as described in Section 13.6; Video match move and stabilization: inserting 2D pictures or 3D models into your videos by automatically tracking nearby reference points (see Section 11.4.4)2 or using motion estimates to remove shake from your videos (see Section 9.2.1); Photo-based walkthroughs: navigating a large collection of photographs, such as the interior of your house, by flying between different photos in 3D (see Sections 14.1.2 and 14.5.5); 2 For a fun student project on this topic, see the “PhotoBook” project at http://www.cc.gatech.edu/dvfx/videos/ dvfx2005.html. 8 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) Face detection: for improved camera focusing as well as more relevant image search- ing (see Section 6.3.1); Visual authentication: automatically logging family members onto your home com- puter as they sit down in front of the webcam (see Section 6.2.4). The great thing about these applications is that they are already familiar to most students; they are, at least, technologies that students can immediately appreciate and use with their own personal media. Since computer vision is a challenging topic, given the wide range of mathematics being covered3 and the intrinsically difficult nature of the problems being solved, having fun and relevant problems to work on can be highly motivating and inspiring. The other major reason why this book has a strong focus on applications is that they can be used to formulate and constrain the potentially open-ended problems endemic in vision. Thus, it is better to think back from the problem at hand to suitable techniques, rather than to grab the first technique that you may have heard of. This kind of working back from problems to solutions is typical of an engineering approach to the study of vision and reflects my own background in the field. First, I come up with a detailed problem definition and decide on the constraints and specifications for the problem. Then, I try to find out which techniques are known to work, implement a few of these, evaluate their performance, and finally make a selection. In order for this process to work, it is important to have realistic test data, both synthetic, which can be used to verify correctness and analyze noise sensitivity, and real-world data typical of the way the system will finally be used. If machine learning is being used, it is even more important to have representative unbiased training data in sufficient quantity to obtain good results on real-world inputs. However, this book is not just an engineering text (a source of recipes). It also takes a scientific approach to basic vision problems. Here, I try to come up with the best possible models of the physics of the system at hand: how the scene is created, how light interacts with the scene and atmospheric effects, and how the sensors work, including sources of noise and uncertainty. The task is then to try to invert the acquisition process to come up with the best possible description of the scene. The book often uses a statistical approach to formulating and solving computer vision problems. Where appropriate, probability distributions are used to model the scene and the noisy image acquisition process. The association of prior distributions with unknowns is often called Bayesian modeling (Appendix B). It is possible to associate a risk or loss function with 3 These techniques include physics, Euclidean and projective geometry, statistics, and optimization. They make computer vision a fascinating field to study and a great way to learn techniques widely applicable in other fields. 1.1 What is computer vision? 9 (a) (b) (c) (d) Figure 1.5 Some consumer applications of computer vision: (a) image stitching: merging different views (Szeliski and Shum 1997) © 1997 ACM; (b) exposure bracketing: merging different exposures; (c) morphing: blending between two photographs (Gomes, Darsa et al. 1999) © 1999 Morgan Kaufmann; (d) smartphone augmented reality showing real-time depth occlusion effects (Valentin, Kowdle et al. 2018) © 2018 ACM. 10 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) misestimating the answer (Section B.2) and to set up your inference algorithm to minimize the expected risk. (Consider a robot trying to estimate the distance to an obstacle: it is usually safer to underestimate than to overestimate.) With statistical techniques, it often helps to gather lots of training data from which to learn probabilistic models. Finally, statistical approaches enable you to use proven inference techniques to estimate the best answer (or distribution of answers) and to quantify the uncertainty in the resulting estimates. Because so much of computer vision involves the solution of inverse problems or the esti- mation of unknown quantities, my book also has a heavy emphasis on algorithms, especially those that are known to work well in practice. For many vision problems, it is all too easy to come up with a mathematical description of the problem that either does not match realistic real-world conditions or does not lend itself to the stable estimation of the unknowns. What we need are algorithms that are both robust to noise and deviation from our models and rea- sonably efficient in terms of run-time resources and space. In this book, I go into these issues in detail, using Bayesian techniques, where applicable, to ensure robustness, and efficient search, minimization, and linear system solving algorithms to ensure efficiency.4 Most of the algorithms described in this book are at a high level, being mostly a list of steps that have to be filled in by students or by reading more detailed descriptions elsewhere. In fact, many of the algorithms are sketched out in the exercises. Now that I’ve described the goals of this book and the frameworks that I use, I devote the rest of this chapter to two additional topics. Section 1.2 is a brief synopsis of the history of computer vision. It can easily be skipped by those who want to get to “the meat” of the new material in this book and do not care as much about who invented what when. The second is an overview of the book’s contents, Section 1.3, which is useful reading for everyone who intends to make a study of this topic (or to jump in partway, since it describes chapter interdependencies). This outline is also useful for instructors looking to structure one or more courses around this topic, as it provides sample curricula based on the book’s contents. 1.2 A brief history In this section, I provide a brief personal synopsis of the main developments in computer vi- sion over the last fifty years (Figure 1.6) with a focus on advances I find personally interesting and that have stood the test of time. Readers not interested in the provenance of various ideas and the evolution of this field should skip ahead to the book overview in Section 1.3. 4 In some cases, deep neural networks have also been shown to be an effective way to speed up algorithms that previously relied on iteration (Chen, Xu, and Koltun 2017). 1.2 A brief history 11 1970 1980 1990 2000 2010 2020 Blocks world, line labeling Generalized cylinders Pattern recognition Stereo correspondence Optical flow Physically-based modeling Regularization Markov random fields Kalman filters Projective invariants Factorization Physics-based vision Graph cuts Particle filtering Energy-based segmentation Computational photography Feature-based recognition Machine learning Semantic segmentation SLAM and VIO Deep learning Vision and language Digital image processing Intrinsic images Structure from motion Image pyramids Shape from shading, texture, and focus 3D range data processing Face recognition and detection Texture synthesis and inpainting Image-based modeling and rendering Modeling and tracking humans Category recognition Figure 1.6 A rough timeline of some of the most active topics of research in computer vision. 1970s. When computer vision first started out in the early 1970s, it was viewed as the visual perception component of an ambitious agenda to mimic human intelligence and to endow robots with intelligent behavior. At the time, it was believed by some of the early pioneers of artificial intelligence and robotics (at places such as MIT, Stanford, and CMU) that solving the “visual input” problem would be an easy step along the path to solving more difficult problems such as higher-level reasoning and planning. According to one well-known story, in 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman to “spend the summer linking a camera to a computer and getting the computer to describe what it saw” (Boden 2006, p. 781).5 We now know that the problem is slightly more difficult than that.6 What distinguished computer vision from the already existing field of digital image pro- cessing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the three-dimensional structure of the world from images and to use this as a stepping stone to- wards full scene understanding. Winston (1975) and Hanson and Riseman (1978) provide two nice collections of classic papers from this early period. Early attempts at scene understanding involved extracting edges and then inferring the 5 Boden (2006) cites (Crevier 1993) as the original source. The actual Vision Memo was authored by Seymour Papert (1966) and involved a whole cohort of students. 6 To see how far robotic vision has come in the last six decades, have a look at some of the videos on the Boston Dynamics https://www.bostondynamics.com, Skydio https://www.skydio.com, and Covariant https://covariant.ai websites. 12 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.7 Some early (1970s) examples of computer vision algorithms: (a) line labeling (Nalwa 1993) © 1993 Addison-Wesley, (b) pictorial structures (Fischler and Elschlager 1973) © 1973 IEEE, (c) articulated body model (Marr 1982) © 1982 David Marr, (d) intrinsic images (Barrow and Tenenbaum 1981) © 1973 IEEE, (e) stereo correspondence (Marr 1982) © 1982 David Marr, (f) optical flow (Nagel and Enkelmann 1986) © 1986 IEEE. 3D structure of an object or a “blocks world” from the topological structure of the 2D lines (Roberts 1965). Several line labeling algorithms (Figure 1.7a) were developed at that time (Huffman 1971; Clowes 1971; Waltz 1975; Rosenfeld, Hummel, and Zucker 1976; Kanade 1980). Nalwa (1993) gives a nice review of this area. The topic of edge detection was also an active area of research; a nice survey of contemporaneous work can be found in (Davis 1975). Three-dimensional modeling of non-polyhedral objects was also being studied (Baum- gart 1974; Baker 1977). One popular approach used generalized cylinders, i.e., solids of revolution and swept closed curves (Agin and Binford 1976; Nevatia and Binford 1977), of- ten arranged into parts relationships7 (Hinton 1977; Marr 1982) (Figure 1.7c). Fischler and Elschlager (1973) called such elastic arrangements of parts pictorial structures (Figure 1.7b). A qualitative approach to understanding intensities and shading variations and explaining them by the effects of image formation phenomena, such as surface orientation and shadows, was championed by Barrow and Tenenbaum (1981) in their paper on intrinsic images (Fig- ure 1.7d), along with the related 2 1/2 -D sketch ideas of Marr (1982). This approach has seen 7 In robotics and computer animation, these linked-part graphs are often called kinematic chains. 1.2 A brief history 13 periodic revivals, e.g., in the work of Tappen, Freeman, and Adelson (2005) and Barron and Malik (2012). More quantitative approaches to computer vision were also developed at the time, in- cluding the first of many feature-based stereo correspondence algorithms (Figure 1.7e) (Dev 1974; Marr and Poggio 1976, 1979; Barnard and Fischler 1982; Ohta and Kanade 1985; Grimson 1985; Pollard, Mayhew, and Frisby 1985) and intensity-based optical flow algo- rithms (Figure 1.7f) (Horn and Schunck 1981; Huang 1981; Lucas and Kanade 1981; Nagel 1986). The early work in simultaneously recovering 3D structure and camera motion (see Chapter 11) also began around this time (Ullman 1979; Longuet-Higgins 1981). A lot of the philosophy of how vision was believed to work at the time is summarized in David Marr’s (1982) book.8 In particular, Marr introduced his notion of the three levels of description of a (visual) information processing system. These three levels, very loosely paraphrased according to my own interpretation, are: Computational theory: What is the goal of the computation (task) and what are the constraints that are known or can be brought to bear on the problem? Representations and algorithms: How are the input, output, and intermediate infor- mation represented and which algorithms are used to calculate the desired result? Hardware implementation: How are the representations and algorithms mapped onto actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con- versely, how can hardware constraints be used to guide the choice of representation and algorithm? With the prevalent use of graphics chips (GPUs) and many-core architec- tures for computer vision, this question is again quite relevant. As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the problem specification and known constraints from image formation and priors (the scientific and statistical approaches) must be married with efficient and robust algorithms (the engineer- ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy is as good a guide to framing and solving problems in our field today as it was 25 years ago. 1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical techniques for performing quantitative image and scene analysis. Image pyramids (see Section 3.5) started being widely used to perform tasks such as im- age blending (Figure 1.8a) and coarse-to-fine correspondence search (Rosenfeld 1980; Burt 8 More recent developments in visual perception theory are covered in (Wandell 1995; Palmer 1999; Livingstone 2008; Frisby and Stone 2010). 14 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.8 Examples of computer vision algorithms from the 1980s: (a) pyramid blending (Burt and Adelson 1983b) © 1983 ACM, (b) shape from shading (Freeman and Adelson 1991) © 1991 IEEE, (c) edge detection (Freeman and Adelson 1991) © 1991 IEEE, (d) physically based models (Terzopoulos and Witkin 1988) © 1988 IEEE, (e) regularization-based surface reconstruction (Terzopoulos 1988) © 1988 IEEE, (f) range data acquisition and merging (Banno, Masuda et al. 2008) © 2008 Springer. and Adelson 1983b; Rosenfeld 1984; Quam 1984; Anandan 1989). Continuous versions of pyramids using the concept of scale-space processing were also developed (Witkin 1983; Witkin, Terzopoulos, and Kass 1986; Lindeberg 1990). In the late 1980s, wavelets (see Sec- tion 3.5.4) started displacing or augmenting regular image pyramids in some applications (Mallat 1989; Simoncelli and Adelson 1990a; Simoncelli, Freeman et al. 1992). The use of stereo as a quantitative shape cue was extended by a wide variety of shape- from-X techniques, including shape from shading (Figure 1.8b) (see Section 13.1.1 and Horn 1975; Pentland 1984; Blake, Zisserman, and Knowles 1985; Horn and Brooks 1986, 1989), photometric stereo (see Section 13.1.1 and Woodham 1981), shape from texture (see Sec- tion 13.1.2 and Witkin 1981; Pentland 1984; Malik and Rosenholtz 1997), and shape from focus (see Section 13.1.3 and Nayar, Watanabe, and Noguchi 1995). Horn (1986) has a nice discussion of most of these techniques. Research into better edge and contour detection (Figure 1.8c) (see Section 7.2) was also active during this period (Canny 1986; Nalwa and Binford 1986), including the introduc- tion of dynamically evolving contour trackers (Section 7.3.1) such as snakes (Kass, Witkin, 1.2 A brief history 15 and Terzopoulos 1988), as well as three-dimensional physically based models (Figure 1.8d) (Terzopoulos, Witkin, and Kass 1987; Kass, Witkin, and Terzopoulos 1988; Terzopoulos and Fleischer 1988). Researchers noticed that a lot of the stereo, flow, shape-from-X, and edge detection al- gorithms could be unified, or at least described, using the same mathematical framework if they were posed as variational optimization problems and made more robust (well-posed) using regularization (Figure 1.8e) (see Section 4.2 and Terzopoulos 1983; Poggio, Torre, and Koch 1985; Terzopoulos 1986b; Blake and Zisserman 1987; Bertero, Poggio, and Torre 1988; Terzopoulos 1988). Around the same time, Geman and Geman (1984) pointed out that such problems could equally well be formulated using discrete Markov random field (MRF) models (see Section 4.3), which enabled the use of better (global) search and optimization algorithms, such as simulated annealing. Online variants of MRF algorithms that modeled and updated uncertainties using the Kalman filter were introduced a little later (Dickmanns and Graefe 1988; Matthies, Kanade, and Szeliski 1989; Szeliski 1989). Attempts were also made to map both regularized and MRF algorithms onto parallel hardware (Poggio and Koch 1985; Poggio, Little et al. 1988; Fischler, Firschein et al. 1989). The book by Fischler and Firschein (1987) contains a nice collection of articles focusing on all of these topics (stereo, flow, regularization, MRFs, and even higher-level vision). Three-dimensional range data processing (acquisition, merging, modeling, and recogni- tion; see Figure 1.8f) continued being actively explored during this decade (Agin and Binford 1976; Besl and Jain 1985; Faugeras and Hebert 1987; Curless and Levoy 1996). The compi- lation by Kanade (1987) contains a lot of the interesting papers in this area. 1990s. While a lot of the previously mentioned topics continued to be explored, a few of them became significantly more active. A burst of activity in using projective invariants for recognition (Mundy and Zisserman 1992) evolved into a concerted effort to solve the structure from motion problem (see Chap- ter 11). A lot of the initial activity was directed at projective reconstructions, which did not require knowledge of camera calibration (Faugeras 1992; Hartley, Gupta, and Chang 1992; Hartley 1994a; Faugeras and Luong 2001; Hartley and Zisserman 2004). Simultane- ously, factorization techniques (Section 11.4.1) were developed to solve efficiently problems for which orthographic camera approximations were applicable (Figure 1.9a) (Tomasi and Kanade 1992; Poelman and Kanade 1997; Anandan and Irani 2002) and then later extended to the perspective case (Christy and Horaud 1996; Triggs 1996). Eventually, the field started using full global optimization (see Section 11.4.2 and Taylor, Kriegman, and Anandan 1991; 16 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.9 Examples of computer vision algorithms from the 1990s: (a) factorization- based structure from motion (Tomasi and Kanade 1992) © 1992 Springer, (b) dense stereo matching (Boykov, Veksler, and Zabih 2001), (c) multi-view reconstruction (Seitz and Dyer 1999) © 1999 Springer, (d) face tracking (Matthews, Xiao, and Baker 2007), (e) image seg- mentation (Belongie, Fowlkes et al. 2002) © 2002 Springer, (f) face recognition (Turk and Pentland 1991). Szeliski and Kang 1994; Azarbayejani and Pentland 1995), which was later recognized as being the same as the bundle adjustment techniques traditionally used in photogrammetry (Triggs, McLauchlan et al. 1999). Fully automated 3D modeling systems were built using such techniques (Beardsley, Torr, and Zisserman 1996; Schaffalitzky and Zisserman 2002; Snavely, Seitz, and Szeliski 2006; Agarwal, Furukawa et al. 2011; Frahm, Fite-Georgel et al. 2010). Work begun in the 1980s on using detailed measurements of color and intensity combined with accurate physical models of radiance transport and color image formation created its own subfield known as physics-based vision. A good survey of the field can be found in the three- volume collection on this topic (Wolff, Shafer, and Healey 1992a; Healey and Shafer 1992; Shafer, Healey, and Wolff 1992). Optical flow methods (see Chapter 9) continued to be improved (Nagel and Enkelmann 1986; Bolles, Baker, and Marimont 1987; Horn and Weldon Jr. 1988; Anandan 1989; Bergen, 1.2 A brief history 17 Anandan et al. 1992; Black and Anandan 1996; Bruhn, Weickert, and Schnörr 2005; Papen- berg, Bruhn et al. 2006), with (Nagel 1986; Barron, Fleet, and Beauchemin 1994; Baker, Scharstein et al. 2011) being good surveys. Similarly, a lot of progress was made on dense stereo correspondence algorithms (see Chapter 12, Okutomi and Kanade (1993, 1994); Boykov, Veksler, and Zabih (1998); Birchfield and Tomasi (1999); Boykov, Veksler, and Zabih (2001), and the survey and comparison in Scharstein and Szeliski (2002)), with the biggest break- through being perhaps global optimization using graph cut techniques (Figure 1.9b) (Boykov, Veksler, and Zabih 2001). Multi-view stereo algorithms (Figure 1.9c) that produce complete 3D surfaces (see Sec- tion 12.7) were also an active topic of research (Seitz and Dyer 1999; Kutulakos and Seitz 2000) that continues to be active today (Seitz, Curless et al. 2006; Schöps, Schönberger et al. 2017; Knapitsch, Park et al. 2017). Techniques for producing 3D volumetric descriptions from binary silhouettes (see Section 12.7.3) continued to be developed (Potmesil 1987; Sri- vasan, Liang, and Hackwood 1990; Szeliski 1993; Laurentini 1994), along with techniques based on tracking and reconstructing smooth occluding contours (see Section 12.2.1 and Cipolla and Blake 1992; Vaillant and Faugeras 1992; Zheng 1994; Boyer and Berger 1997; Szeliski and Weiss 1998; Cipolla and Giblin 2000). Tracking algorithms also improved a lot, including contour tracking using active contours (see Section 7.3), such as snakes (Kass, Witkin, and Terzopoulos 1988), particle filters (Blake and Isard 1998), and level sets (Malladi, Sethian, and Vemuri 1995), as well as intensity-based (direct) techniques (Lucas and Kanade 1981; Shi and Tomasi 1994; Rehg and Kanade 1994), often applied to tracking faces (Figure 1.9d) (Lanitis, Taylor, and Cootes 1997; Matthews and Baker 2004; Matthews, Xiao, and Baker 2007) and whole bodies (Sidenbladh, Black, and Fleet 2000; Hilton, Fua, and Ronfard 2006; Moeslund, Hilton, and Krüger 2006). Image segmentation (see Section 7.5) (Figure 1.9e), a topic which has been active since the earliest days of computer vision (Brice and Fennema 1970; Horowitz and Pavlidis 1976; Riseman and Arbib 1977; Rosenfeld and Davis 1979; Haralick and Shapiro 1985; Pavlidis and Liow 1990), was also an active topic of research, producing techniques based on min- imum energy (Mumford and Shah 1989) and minimum description length (Leclerc 1989), normalized cuts (Shi and Malik 2000), and mean shift (Comaniciu and Meer 2002). Statistical learning techniques started appearing, first in the application of principal com- ponent eigenface analysis to face recognition (Figure 1.9f) (see Section 5.2.3 and Turk and Pentland 1991) and linear dynamical systems for curve tracking (see Section 7.3.1 and Blake and Isard 1998). Perhaps the most notable development in computer vision during this decade was the increased interaction with computer graphics (Seitz and Szeliski 1999), especially in the 18 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.10 Examples of computer vision algorithms from the 2000s: (a) image-based rendering (Gortler, Grzeszczuk et al. 1996), (b) image-based modeling (Debevec, Taylor, and Malik 1996) © 1996 ACM, (c) interactive tone mapping (Lischinski, Farbman et al. 2006) (d) texture synthesis (Efros and Freeman 2001), (e) feature-based recognition (Fergus, Perona, and Zisserman 2007), (f) region-based recognition (Mori, Ren et al. 2004) © 2004 IEEE. cross-disciplinary area of image-based modeling and rendering (see Chapter 14). The idea of manipulating real-world imagery directly to create new animations first came to prominence with image morphing techniques (Figure1.5c) (see Section 3.6.3 and Beier and Neely 1992) and was later applied to view interpolation (Chen and Williams 1993; Seitz and Dyer 1996), panoramic image stitching (Figure1.5a) (see Section 8.2 and Mann and Picard 1994; Chen 1995; Szeliski 1996; Szeliski and Shum 1997; Szeliski 2006a), and full light-field rendering (Figure 1.10a) (see Section 14.3 and Gortler, Grzeszczuk et al. 1996; Levoy and Hanrahan 1996; Shade, Gortler et al. 1998). At the same time, image-based modeling techniques (Fig- ure 1.10b) for automatically creating realistic 3D models from collections of images were also being introduced (Beardsley, Torr, and Zisserman 1996; Debevec, Taylor, and Malik 1996; Taylor, Debevec, and Malik 1996). 2000s. This decade continued to deepen the interplay between the vision and graphics fields, but more importantly embraced data-driven and learning approaches as core compo- 1.2 A brief history 19 nents of vision. Many of the topics introduced under the rubric of image-based rendering, such as image stitching (see Section 8.2), light-field capture and rendering (see Section 14.3), and high dynamic range (HDR) image capture through exposure bracketing (Figure1.5b) (see Section 10.2 and Mann and Picard 1995; Debevec and Malik 1997), were re-christened as computational photography (see Chapter 10) to acknowledge the increased use of such tech- niques in everyday digital photography. For example, the rapid adoption of exposure brack- eting to create high dynamic range images necessitated the development of tone mapping algorithms (Figure 1.10c) (see Section 10.2.1) to convert such images back to displayable results (Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Reinhard, Stark et al. 2002; Lischinski, Farbman et al. 2006). In addition to merging multiple exposures, tech- niques were developed to merge flash images with non-flash counterparts (Eisemann and Durand 2004; Petschnigg, Agrawala et al. 2004) and to interactively or automatically select different regions from overlapping images (Agarwala, Dontcheva et al. 2004). Texture synthesis (Figure 1.10d) (see Section 10.5), quilting (Efros and Leung 1999; Efros and Freeman 2001; Kwatra, Schödl et al. 2003), and inpainting (Bertalmio, Sapiro et al. 2000; Bertalmio, Vese et al. 2003; Criminisi, Pérez, and Toyama 2004) are additional topics that can be classified as computational photography techniques, since they re-combine input image samples to produce new photographs. A second notable trend during this decade was the emergence of feature-based techniques (combined with learning) for object recognition (see Section 6.1 and Ponce, Hebert et al. 2006). Some of the notable papers in this area include the constellation model of Fergus, Perona, and Zisserman (2007) (Figure 1.10e) and the pictorial structures of Felzenszwalb and Huttenlocher (2005). Feature-based techniques also dominate other recognition tasks, such as scene recognition (Zhang, Marszalek et al. 2007) and panorama and location recog- nition (Brown and Lowe 2007; Schindler, Brown, and Szeliski 2007). And while interest point (patch-based) features tend to dominate current research, some groups are pursuing recognition based on contours (Belongie, Malik, and Puzicha 2002) and region segmentation (Figure 1.10f) (Mori, Ren et al. 2004). Another significant trend from this decade was the development of more efficient al- gorithms for complex global optimization problems (see Chapter 4 and Appendix B.5 and Szeliski, Zabih et al. 2008; Blake, Kohli, and Rother 2011). While this trend began with work on graph cuts (Boykov, Veksler, and Zabih 2001; Kohli and Torr 2007), a lot of progress has also been made in message passing algorithms, such as loopy belief propagation (LBP) (Yedidia, Freeman, and Weiss 2001; Kumar and Torr 2006). The most notable trend from this decade, which has by now completely taken over visual recognition and most other aspects of computer vision, was the application of sophisticated 20 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) (a) (b) (c) (d) (e) (f) Figure 1.11 Examples of computer vision algorithms from the 2010s: (a) the SuperVision deep neural network © Krizhevsky, Sutskever, and Hinton (2012); (b) object instance seg- mentation (He, Gkioxari et al. 2017) © 2017 IEEE; (c) whole body, expression, and gesture fitting from a single image (Pavlakos, Choutas et al. 2019) © 2019 IEEE; (d) fusing mul- tiple color depth images using the KinectFusion real-time system (Newcombe, Izadi et al. 2011) © 2011 IEEE; (e) smartphone augmented reality with real-time depth occlusion effects (Valentin, Kowdle et al. 2018) © 2018 ACM; (f) 3D map computed in real-time on a fully autonomous Skydio R1 drone (Cross 2019). machine learning techniques to computer vision problems (see Chapters 5 and 6). This trend coincided with the increased availability of immense quantities of partially labeled data on the internet, as well as significant increases in computational power, which makes it more feasible to learn object categories without the use of careful human supervision. 2010s. The trend towards using large labeled (and also self-supervised) datasets to develop machine learning algorithms became a tidal wave that totally revolutionized the development of image recognition algorithms as well as other applications, such as denoising and optical flow, which previously used Bayesian and global optimization techniques. This trend was enabled by the development of high-quality large-scale annotated datasets such as ImageNet (Deng, Dong et al. 2009; Russakovsky, Deng et al. 2015), Microsoft COCO (Common Objects in Context) (Lin, Maire et al. 2014), and LVIS (Gupta, Dollár, and Gir- 1.2 A brief history 21 shick 2019). These datasets provided not only reliable metrics for tracking the progress of recognition and semantic segmentation algorithms, but more importantly, sufficient labeled data to develop complete solutions based on machine learning. Another major trend was the dramatic increase in computational power available from the development of general purpose (data-parallel) algorithms on graphical processing units (GPGPU). The breakthrough SuperVision (“AlexNet”) deep neural network (Figure 1.11a; Krizhevsky, Sutskever, and Hinton 2012), which was the first neural network to win the yearly ImageNet large-scale visual recognition challenge, relied on GPU training, as well as a number of technical advances, for its dramatic performance. After the publication of this paper, progress in using deep convolutional architectures accelerated dramatically, to the point where they are now the only architecture considered for recognition and semantic seg- mentation tasks (Figure 1.11b), as well as the preferred architecture for many other vision tasks (Chapter 5; LeCun, Bengio, and Hinton 2015), including optical flow (Sun, Yang et al. 2018)), denoising, and monocular depth inference (Li, Dekel et al. 2019). Large datasets and GPU architectures, coupled with the rapid dissemination of ideas through timely publications on arXiv as well as the development of languages for deep learn- ing and the open sourcing of neural network models, all contributed to an explosive growth in this area, both in rapid advances and capabilities, and also in the sheer number of publica- tions and researchers now working on these topics. They also enabled the extension of image recognition approaches to video understanding tasks such as action recognition (Feichten- hofer, Fan et al. 2019), as well as structured regression tasks such as real-time multi-person body pose estimation (Cao, Simon et al. 2017). Specialized sensors and hardware for computer vision tasks also continued to advance. The Microsoft Kinect depth camera, released in 2010, quickly became an essential component of many 3D modeling (Figure 1.11d) and person tracking (Shotton, Fitzgibbon et al. 2011) systems. Over the decade, 3D body shape modeling and tracking systems continued to evolve, to the point where it is now possible to infer a person’s 3D model with gestures and expression from a single image (Figure 1.11c). And while depth sensors have not yet become ubiquitous (except for security applications on high-end phones), computational photography algorithms run on all of today’s smart- phones. Innovations introduced in the computer vision community, such as panoramic image stitching and bracketed high dynamic range image merging, are now standard features, and multi-image low-light denoising algorithms are also becoming commonplace (Liba, Murthy et al. 2019). Lightfield imaging algorithms, which allow the creation of soft depth-of-field effects, are now also becoming more available (Garg, Wadhwa et al. 2019). Finally, mo- bile augmented reality applications that perform real-time pose estimation and environment 22 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) augmentation using combinations of feature tracking and inertial measurements are com- monplace, and are currently being extended to include pixel-accurate depth occlusion effects (Figure 1.11e). On higher-end platforms such as autonomous vehicles and drones, powerful real-time SLAM (simultaneous localization and mapping) and VIO (visual inertial odometry) algo- rithms (Engel, Schöps, and Cremers 2014; Forster, Zhang et al. 2017; Engel, Koltun, and Cremers 2018) can build accurate 3D maps that enable, e.g., autonomous flight through chal- lenging scenes such as forests (Figure 1.11f). In summary, this past decade has seen incredible advances in the performance and reli- ability of computer vision algorithms, brought in part by the shift to machine learning and training on very large sets of real-world data. It has also seen the application of vision algo- rithms in myriad commercial and consumer scenarios as well as new challenges engendered by their widespread use (Su and Crandall 2021). 1.3 Book overview In the final part of this introduction, I give a brief tour of the material in this book, as well as a few notes on notation and some additional general references. Since computer vision is su