Introduction to Computer Vision Fall 2024 Lecture Notes PDF
Document Details
Uploaded by EnticingAlexandrite7936
Toronto Metropolitan University
2024
Toronto Metropolitan University
Omar Falou
Tags
Summary
These lecture notes cover introduction to computer vision, delving into local features, Harris corner detection, and considerations like panorama stitching for the Fall 2024 academic year, taught at Toronto Metropolitan University.
Full Transcript
CPS834/CPS8307 Introduction to Computer Vision Dr. Omar Falou Toronto Metropolitan University Fall 2024 Local features & Harris corner detection Reading Szeliski: 7.1 Feature extraction—Corners and blobs Motivation: Automatic panoramas...
CPS834/CPS8307 Introduction to Computer Vision Dr. Omar Falou Toronto Metropolitan University Fall 2024 Local features & Harris corner detection Reading Szeliski: 7.1 Feature extraction—Corners and blobs Motivation: Automatic panoramas Credit: Matt Brown Panorama stitching Panorama captured by Perseverence Rover, Feb. 20, 2021 https://www.space.com/nasa-perseverance-rover-first-panorama-mars Motivation: Automatic panoramas GigaPan: http://gigapan.com/ Also see Google Zoom Views: https://www.google.com/culturalinstitute/beta/project/gigapixels Why extract features? Motivation: panorama stitching – We have two images – how do we combine them? Why extract features? Motivation: panorama stitching – We have two images – how do we combine them? Step 1: extract features Step 2: match features Why extract features? Motivation: panorama stitching – We have two images – how do we combine them? Step 1: extract features Step 2: match features Step 3: align images Application: Visual SLAM (aka Simultaneous Localization and Mapping) VSLAM refers to the process of calculating the position and orientation of a camera, with respect to its surroundings, while simultaneously mapping the environment. The process uses only visual inputs from the camera. Do these images overlap? Answer below (look for tiny colored squares…) NASA Mars Rover images with SIFT feature matches Feature matching for object search Feature matching Invariant local features Find features that are invariant to transformations – geometric invariance: translation, rotation, scale – photometric invariance: brightness, exposure, … Feature Descriptors Advantages of local features Locality – features are local, so robust to occlusion and clutter Quantity – hundreds or thousands in a single image Distinctiveness: – can differentiate a large database of objects Efficiency – real-time performance achievable More motivation… Feature points are used for: – Image alignment (e.g., mosaics) – 3D reconstruction – Motion tracking (e.g. for AR) – Object recognition – Image retrieval – Robot/car navigation – … other Local features: main components 1) Detection: Identify the interest points 2) Description: Extract vector x1 = [ x1(1) ,, xd(1) ] feature descriptor surrounding each interest point 3) Matching: Determine x 2 = [ x1( 2) ,, xd( 2) ] correspondence between descriptors in two views Credit: Kristen Grauman What makes a good feature? Snoop demo Want uniqueness Look for image regions that are unusual – Lead to unambiguous matches in other images How to define “unusual”? Local measures of uniqueness Suppose we only consider a small window of pixels – What defines whether a feature is a good or bad candidate? Credit: S. Seitz, D. Frolova, D. Simakov Local measures of uniqueness How does the window change when you shift it? Shifting the window in any direction causes a big change “flat” region: “edge”: “corner”: no change in all no change along significant change directions the edge direction in all directions Credit: S. Seitz, D. Frolova, D. Simakov Harris corner detection: the math Consider shifting the window W by (u,v) how do the pixels in W change? compare each pixel before and after by summing up the squared differences (SSD) (u,v) W this defines an SSD “error” E(u,v): We are happy if this error is high We are very happy if this error is high for all offsets (u,v) Slow to compute exactly for each pixel and each offset (u,v) Chris Harris and Mike Stephens (1988). "A Combined Corner and Edge Detector". Alvey Vision Conference F value The f value is often referred to as the Harris response (R), and it indicates whether a pixel is classified as a corner (high positive R), an edge (negative R), or a flat region (small or zero R). The higher the f (R) value, the stronger the corner feature detected at that point. F value vs. SSD F value (R) is used specifically for corner detection to classify image pixels as corners, edges, or flat regions. SSD is used for comparing and matching patches from different parts of an image or between two images Harris detector example f value (red high, blue low) Threshold (f > value) Find local maxima of f Harris features (in red) Harris Corners – Why so complicated? Can’t we just check for regions with lots of gradients in the x and y directions? – No! A diagonal line would satisfy that criteria Current Window Feature invariance Reading Szeliski (2nd edition): 7.1 Panorama stitching Panorama captured by Perseverence Rover, Feb. 20, 2021 https://www.space.com/nasa-perseverance-rover-first-panorama-mars Local features: main components 1) Detection: Identify the interest points x1 = [ x1(1) ,, xd(1) ] 2) Description: Extract vector feature descriptor surrounding each interest point. x 2 = [ x1( 2) ,, xd( 2) ] 3) Matching: Determine correspondence between descriptors in two views Kristen Grauman Harris features (in red) Image transformations Geometric Rotation Scale Photometric Intensity change Invariance and equivariance We want corner locations to be invariant to photometric transformations and equivariant to geometric transformations – Invariance: image is transformed and corner locations do not change – Equivariance: if we have two transformed versions of the same image, features should be detected in corresponding locations – (Sometimes “invariant” and “equivariant” are both referred to as “invariant”) – (Sometimes “equivariant” is called “covariant”) Harris detector invariance properties: image translation Derivatives and window function are equivariant Corner location is equivariant w.r.t. translation Harris detector invariance properties: image rotation Second moment ellipse rotates but its shape remains the same Corner location is equivariant w.r.t. image rotation Harris detector invariance properties: Affine intensity change I→aI+b Only derivatives are used → invariance to intensity shift I → I + b Intensity scaling: I → a I R R threshold x (image coordinate) x (image coordinate) Partially invariant to affine intensity change Harris detector invariance properties: scaling Corner All points will be classified as edges Neither invariant nor equivariant to scaling Scale invariant detection Suppose you’re looking for corners Key idea: find scale that gives local maximum of f – in both position and scale – One definition of f: the Harris operator Automatic Scale Selection by: Lindeberg et al. 1996 Feature extraction: Corners and blobs Blob detection: Basic idea To detect blobs, convolve the image with a “blob filter” at multiple scales and look for extrema of filter response in the resulting scale space Another common definition of f The Laplacian of Gaussian (LoG) g g 2 2 (very similar to a Difference of Gaussians (DoG) g= 2 + 2 2 – i.e. a Gaussian minus a slightly smaller x y Gaussian) Laplacian of Gaussian “Blob” detector minima = * maximum Find maxima and minima of LoG operator in space and scale Recall: Edge Detection Edge Detection, take 2 From Edge to Blob Scale Selection We want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response However, Laplacian response decays as scale increases: Scale normalization The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases To keep response the same (scale-invariant), must multiply Gaussian derivative by σ Laplacian is the second Gaussian derivative, so it must be multiplied by σ2 Effect of scale normalization Blob detection in 2D Scale selection At what scale does the Laplacian achieve a maximum response for a binary circle of radius r? r image Laplacian Scale selection Characteristic scale We define the characteristic scale as the scale that produces peak of Laplacian response characteristic scale T. Lindeberg (1998). "Feature detection with automatic scale selection." International Journal of Computer Vision 30 (2): pp 77--116. Scale-space blob detector 1. Convolve image with scale-normalized Laplacian at several scales. 2. Find maxima of squared Laplacian response in scale- space. Scale-space blob detector: Example Scale-space blob detector: Example Scale-space blob detector: Example Feature descriptors We know how to detect good points Next question: How to match them? ? Answer: Come up with a descriptor for each point, find similar descriptors between the two images Feature descriptors and feature matching Reading Szeliski (2nd edition) 7.1 Local features: main components 1) Detection: Identify the interest points 2) Description: Extract x1 = [ x1(1) ,, xd(1) ] vector feature descriptor surrounding each interest point. x 2 = [ x1( 2) ,, xd( 2) ] 3) Matching: Determine correspondence between descriptors in two views Kristen Grauman Feature descriptors We know how to detect good points Next question: How to match them? ? Answer: Come up with a descriptor for each point, find similar descriptors between the two images Feature descriptors We know how to detect good points Next question: How to match them? ? Lots of possibilities – Simple option: match square windows around the point – State of the art approach: SIFT David Lowe, UBC http://www.cs.ubc.ca/~lowe/keypoints/ Invariance vs. discriminability Invariance: – Descriptor shouldn’t change even if image is transformed Discriminability: – Descriptor should be highly unique for each point Image transformations revisited Geometric Rotation Scale Photometric Intensity change Invariant descriptors We looked at invariant / equivariant detectors Most feature descriptors are also designed to be invariant to: – Translation, 2D rotation, scale They can usually also handle – Limited 3D rotations (SIFT works up to about 60 degrees) – Limited affine transforms (some are fully affine invariant) – Limited illumination/contrast changes How to achieve invariance Need both of the following: 1. Make sure your detector is invariant 2. Design an invariant feature descriptor – Simplest descriptor: a single 0 – Next simplest descriptor: a square, axis-aligned 5x5 window of pixels – Let’s look at some better approaches… Scale Invariant Feature Transform Basic idea: Take 16x16 square window around detected feature Compute edge orientation (angle of the gradient - 90) for each pixel Throw out weak edges (threshold gradient magnitude) Create histogram of surviving edge orientations Shift the bins so that the biggest one is first 0 2 angle histogram Adapted from slide by David Lowe SIFT descriptor Full version Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below) Compute an orientation histogram for each cell 16 cells * 8 orientations = 128 dimensional descriptor Histogram of gradients => descriptor vector Adapted from slide by David Lowe Properties of SIFT Extraordinarily robust matching technique – Can handle changes in viewpoint (up to about 60 degree out of plane rotation) – Can handle significant changes in illumination (sometimes even day vs. night (below)) – Pretty fast—hard to make real-time, but can run in