SLAM_2024.pdf

AI FOR AUTONOMOUS ROBOTS SFM, SLAM AND VO DR. FRANCESCO NEX Overview  Pinhole camera model  Single image orientation  Stereo-pair orientation  Feature matching  Structure from Motion  Visual Odometry  SLAM methods  Deep learning in SLAM Intrinsic Parameters The pinhole camera model describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane. Transformation from image coordinates to pixel coordinates p = 𝐾𝐾𝐾𝐾 u   f 0 u0   x  focal length: 𝑓𝑓  v  =  0 f v   y  principal point: 𝑢𝑢𝑜𝑜 and 𝑣𝑣𝑜𝑜    0   Assuming same pixel size in x and y directions 1   0 0 1   1  K= intrinsic parameters matrix Extrinsic Parameters Extrinsic parameters define the position of the camera center and the camera orientation in world coordinates. What does it remind you? Yc Motion of camera (R,T) Yw Xc Xw Zc Zw World 3D coordinates = [𝑋𝑋𝑤𝑤 , 𝑌𝑌𝑤𝑤 , 𝑍𝑍𝑤𝑤 ]𝑇𝑇  r11 r12 r13  tX  Camera 3D coordinates = [𝑋𝑋𝑐𝑐 , 𝑌𝑌𝑐𝑐 , 𝑍𝑍𝑐𝑐 ] 𝑇𝑇     R =  r21 r22 r23 , T =  tY  Rotation between both systems: R r r32 r33  t  Translation between systems : T  31  Z World 3D to camera 2D coordinates Camera motion (rotation=R, translation=T) will cause change of pixel position (𝑥𝑥, 𝑦𝑦) Xc Xw  Y  = R Y  −T    c  w   Z    Z c    w   in homogeneous coordinates Xc Xw Y  R − R * T   Yw   = c      Zc   0  1  Zw      1    1  Projection matrix P x = K int K ext X Extrinsic parameters = rotation + translation x = K int R[I | −T ]X P is a 3x4 Matrix P = K int R[I | −T ] → x = PX 11 degree of freedom:  5 from K P is the projection or camera matrix  3 from rotation X   3 from translation  u   p11 p12 p13 p14       Y  v =    21p p 22 p 23 p 24   w  p Z     31 p 32 p 33 p  34    1   Camera to World to 3D 2D point = pixel coord. Rotation matrix camera coord. point trans. matrix (3x3) trans. matrix (4x1) (3x1) (3x3) (3x4) Direct Linear Transformation (DLT) – single image orientation 𝑢𝑢 𝑣𝑣  𝑥𝑥𝑖𝑖𝑖𝑖 = , 𝑦𝑦 = Let "𝑃𝑃𝑖𝑖" be row “i" of matrix "𝑃𝑃𝑃 𝑤𝑤 𝑖𝑖𝑖𝑖 𝑤𝑤 𝑃𝑃1.𝑋𝑋 𝑥𝑥𝑖𝑖𝑖𝑖 = → (𝑃𝑃1 -𝑥𝑥𝑖𝑖𝑖𝑖 𝑃𝑃3 ). 𝑋𝑋 𝑃𝑃3.𝑋𝑋 𝑃𝑃2.𝑋𝑋 𝑦𝑦𝑖𝑖𝑖𝑖 = → (𝑃𝑃2 -𝑦𝑦𝑖𝑖𝑖𝑖 𝑃𝑃3 ). 𝑋𝑋 𝑃𝑃3.𝑋𝑋 𝑃𝑃1 𝑋𝑋 𝑡𝑡 0𝑡𝑡 −𝑥𝑥𝑖𝑖𝑖𝑖 𝑋𝑋 𝑡𝑡 0  In matrix form 𝑡𝑡 𝑃𝑃2 = At least 6 points! 0 𝑋𝑋 𝑡𝑡 −𝑦𝑦𝑖𝑖𝑖𝑖 𝑋𝑋 𝑡𝑡 𝑃𝑃3 0  Expanding this to see the elements for one image point: 𝑃𝑃1 𝑋𝑋 𝑌𝑌 𝑍𝑍 1 0 0 0 0 − 𝑥𝑥𝑖𝑖𝑖𝑖 𝑋𝑋 − 𝑥𝑥𝑖𝑖𝑖𝑖 𝑌𝑌 − 𝑥𝑥𝑖𝑖𝑖𝑖 𝑍𝑍 − 𝑥𝑥𝑖𝑖𝑖𝑖 ⋮ = 0 0 0 0 0 𝑋𝑋 𝑌𝑌 𝑍𝑍 1 − 𝑦𝑦𝑖𝑖𝑖𝑖 𝑋𝑋 − 𝑦𝑦𝑖𝑖𝑖𝑖 𝑌𝑌 − 𝑦𝑦𝑖𝑖𝑖𝑖 𝑍𝑍 − 𝑦𝑦𝑖𝑖𝑖𝑖 0 𝑃𝑃12  The equation system is solved by SVD. Orientation of a stereo camera Using a stereo camera system we can estimate the world coordinates of an object in every acquisition If B is known, the distance of each point visible in both images can be estimated by triangulation Origin (0,0,0) (B,0,0) 1 0 0 𝑟𝑟11 𝑟𝑟12 𝑟𝑟13 𝑥𝑥1 = 𝑃𝑃1 Χ, 𝑥𝑥2 = 𝑃𝑃2 Χ for (𝑃𝑃1 , 𝑃𝑃2 ) 𝑟𝑟21 𝑟𝑟22 𝑟𝑟23 p14   P1T  0 1 0  p11 p12 p13 𝑟𝑟31 𝑟𝑟32 𝑟𝑟33 P( 3 x 4 ) =  p21 p22 p23   p24  =  P 2T  0 0 1  p31 p32 p33 p44   P 3T  B where P iT = i th row of P i.e. P1T = [ p11 p12 p13 p14 ],  p11   p21   p31  p  p  p  1 p =  12  2 ,p =  22  , p =  32  3  p13   p23   p33         p14   p24   p34  Orientation of a stereo camera Using a stereo camera system we can estimate the world coordinates of an object in every acquisition If B is known, the distance of each point visible in both images can be estimated by triangulation Origin (0,0,0) (B,0,0) 1 0 0 𝑟𝑟11 𝑟𝑟12 𝑟𝑟13 𝑥𝑥1 = 𝑃𝑃1 Χ, 𝑥𝑥2 = 𝑃𝑃2 Χ for (𝑃𝑃1 , 𝑃𝑃2 ) 𝑟𝑟21 𝑟𝑟22 𝑟𝑟23 0 1 0 0 0 1 𝑟𝑟31 𝑟𝑟32 𝑟𝑟33 𝑢𝑢1 𝑝𝑝13𝑇𝑇 − 𝑝𝑝11𝑇𝑇 𝑣𝑣1 𝑝𝑝13𝑇𝑇 − 𝑝𝑝12𝑇𝑇 𝑋𝑋 AΧ = 𝑌𝑌 = 0, by SVD B 𝑢𝑢2 𝑝𝑝23𝑇𝑇 − 𝑝𝑝21𝑇𝑇 𝑍𝑍 𝑣𝑣2 𝑝𝑝23𝑇𝑇 − 𝑝𝑝22𝑇𝑇 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 This process can be repeated for all the different pixels of the images, generating a real-time map of the scene. Epipolar geometry  The center of each camera, the origins of {1} and {2}, plus the world point P defines a plane in space – the epipolar plane  The world point P is projected onto the image planes of the two cameras at pixel coordinates 1p and 2p respectively, and these points are known as conjugate points  Two special points: 𝑒𝑒1 and 𝑒𝑒2 (the epipoles): projection of one camera into the other.  All of the epipolar lines in an image pass through the epipole. Robotics, vision, and control 2013 Epipolar geometry  A fundamental and important geometric relationship is the Epipolar Constraint: given a point in one image we know that its conjugate is constrained to lie along a line in the other image.  The image point p1 is a function of the world point P. The camera center, e1 and p1 define the epipolar plane and hence the epipolar line 2l in image two. By definition the conjugate point p2 must lie on that line. Robotics, vision, and control 2013 Fundamental Matrix F This epipolar geometry of two views is described by a specific 3x3 matrix 𝑭𝑭 , called the fundamental matrix F maps (tie) points in image 1 to lines in image 2! 𝑢𝑢 𝑢𝑢 𝑝𝑝 = → 𝑖𝑖𝑖𝑖 ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐. = 𝑣𝑣 𝑣𝑣 1  The epipolar line (in image2) of point p is: 𝑭𝑭𝑭𝑭  𝑭𝑭𝒕𝒕𝒒𝒒 is the epipolar line associated with q Epipolar constraint on corresponding points can be expressed as: 𝐪𝐪𝐭𝐭𝐅𝐅𝐅𝐅 = 𝟎𝟎 F matrix properties  It is singular with a rank of two  seven degrees of freedom.  𝐹𝐹𝑒𝑒1 = 0  𝐹𝐹 𝑡𝑡 𝑒𝑒2 = 0  The epipoles e are encoded in the null-space of the matrix. The epipole for camera one is the right null-space of 𝑭𝑭. Fundamental matrix – uncalibrated SfM case The fundamental matrix is a function of the camera parameters and the relative orientation of camera pose between the images Estimating F – stereo-pair orientation Different algorithms have been proposed. The most used methods are:  8-points algorithm – linear solution (see supplementary slides) Enforce the matrix to have rank=2. It can have problems especially with noisy points.  7-points algorithms Use only seven points and achieve a cubic polynomial solution in α. F = αF1 + (1- α)F2 Define F getting the real solutions for α. → for more information please refer to: Hartley and Zisserman (2004). Multiple View Geometry in Computer Vision, Cambridge University Press. Traditional image matching workflow How do we get the corresponding points between images? left right Feature Feature extraction extraction Similarity measure Regularization Matched points Disparity Depth Point cloud measure estimation Feature matching Matching is mostly done only at salient points (corners, blobs, etc.). Techniques: matching is performed comparing keypoint descriptors  sparse points from images Feature matching - corners How to identify a corner feature in an image?  Corners are identified based on image gradients (= directional change in the intensity in an image)  However, one gradient alone can be only used to detect an edge (since directional change is only in one direction)  For corners, two gradients with a perpendicular direction are needed Feature matching - Blobs Blobs are regions with positive or negative brightness or colour value compared to their neighbourhood Basic algorithm: 1. First, images are filtered to simulate different scales (e.g. Gaussian blurring) → scale-space representation 2. Filtered image at one scale is subtracted from filtered image at previous scale 3. Check for local extrema across scales (e.g. 3*3*3) → blob detected Feature matching – descriptors and similarity measures Once a feature has been identified in an image, it has to be “described” in order to enable matching across different images. In general, a descriptor contains information about the immediate pixel neighbourhood of an identified feature and transforms it into a compact vector. Descriptors can be vectors of (1) floating or (2) binary numbers This information can comprise either image gradients or intensity comparisons. Depending on the detector and the descriptor method used, descriptors can account for a wide array of invariances between images! → Scale, rotation, illumination, affine invariances Feature matching – descriptors and similarity measures Example of SIFT 1. A 16*16 pixel neighbourhood around an identified (Lowe, 2004) feature point (also keypoint) is selected 2. The orientations (simplified to 8 possible directions) of the gradients are computed for a 4*4 array in a 16*16 image region 3. Then, they are stored in a 4*4 keypoint descriptor with 8 possible orientations which are weighted → 128 byte descriptor vector 4. Similarity is assessed considering the Euclidean distance 1 1 0 1 1 0 0 0 1 … 1.23 2.58 0.23 3.41 1.23 0.58 … … … … Triangulation from two views Find X from 2 views  we have P1,P2.  We also have 2D point correspondences: [u,v]1,[u,v]2  We can find X, see the next slide  With multi-view triangulation, these first two views serves as initial solution as will be discussed later. X Projection matrix P1 [u,v]1 [u,v]2 O1 Image R1,T1 t=1 Image O2 t=2 Projection matrix R2,T2 P2 Triangulation from two views x1 = P1Χ, x2 = P2 Χ for 2 cameras (P1,P2 ) x1 × x1 = x1 × P1Χ = 0  p11 p12 p13 p14   P1T  u1  u2    P(3 x 4 ) =  p21 p22 p23 p24  =  P 2T  x1 = v1 , x2 = v2   p31 p32 p33 p44   P 3T  1  1  where P iT = i th row of P after_some_manipulations ( general ) i.e. P1T = [ p11 p12 p13 p14 ], 3T 1T u ( p Χ ) − ( p Χ ) = 0 − − − − − (i )  p11   p21   p31  p  p  p  v( p 3T Χ ) − ( p 2T Χ ) = 0 − − − − − (ii ) 1 p =  12  2 ,p =  22  , p =  32  3 u ( p 2T Χ ) − v( p1T Χ ) = 0 − − − − − (iii )  p13   p23   p33        2 cameras (P1,P2 )  p14   p24   p34   u1 p13T − p11T   3T 2T  X  vp −p AΧ =  1 13T 11T   Y  = 0, solve by SVD u 2 p2 − p2   3T  Z  2T   v2 p2 − p2  unknown     known Multiple view geometry in computer vision / Richard Hartley and Andrew Zisserman, 2nd ed.,Cambridge, UK : Cambridge University Press, 2003. But how to make this automated? We have seen in the former lesson how feature matching works.  Feature extractors allow the extraction of many tie-points and their matching using the information provided by descriptors.  We know that several tie-points are needed to compute the F matrix. Questions:  Are all the extracted features correct?  How can we remove the wrong matches? RANSAC: Random Sample Consensus  An algorithm for robust fitting of models in the presence of many data outliers.  Idea: instead of starting with the full data and trying to work down to a smaller subset of good data, start small and work to larger. Example on LINE FITTING  Given a hypothesized line  Count the number of points that “agree” with the line  “Agree” = within a small distance from the line  I.e., the inliers to that line  For all possible lines, select the one with the largest number of inliers http://en.wikipedia.org/wiki/RANSAC RANSAC: Random Sample Consensus Procedure: 1. Randomly choose s samples Typically s= minimum sample size that lets you fit a model 2. Fit a model (e.g., line→2 parameters) to those samples 3. Count the number of inliers that approximately fit the model 4. Repeat N times 5. Choose the model that has the largest set of inliers RANSAC + F matrix Step 1. Extract features Step 2. Compute a set of potential matches Step 3. do iterations Step 3.1 select minimal sample (i.e. 7 or 8 matches) Step 3.2 compute solution(s) for F using the minimal samples Step 3.3 determine number of inliers stop when satisfy a threshold or max iterations Step 4. Compute F based on all inliers Structure from Motion – n-images orientation SfM refers to the process of extracting three dimensional structure of the scene as well as camera motions by analyzing an image sequence. Requirements: One camera (or more).  A set of overlapped images. Assumptions: The scene is rigid. Some of the camera intrinsic parameters are known or constant. Structure from Motion – n-images orientation Given: m images of n fixed 3D points 𝑥𝑥𝑖𝑖𝑖𝑖 = 𝑃𝑃𝑖𝑖 𝑋𝑋𝑗𝑗 , 𝑖𝑖 = 1, … , 𝑚𝑚, 𝑗𝑗 = 1, … , 𝑛𝑛 Problem: estimate m projection matrices 𝑃𝑃𝑖𝑖 (intrinsic and extrinsic) and n 3D points 𝑋𝑋𝑗𝑗 from the mn correspondences 𝑥𝑥𝑖𝑖𝑖𝑖 in a sparse structure Structure from Motion - Practical implementation There are different strategies for bundle implementation like the following which is presented by Pollefeys M. : Step 1. Match or track points over the whole image sequence. Step 2. Initialize the structure and motion recovery Step 2.1. Select two views that are suited for initialization. Step 2.2. Relate these views by computing the two view geometry. Step 2.3. Set up the initial frame. Step 2.4. Reconstruct the initial structure. Step 3. For every additional view Step 3.1. Infer matches to the structure and compute the camera pose using a robust algorithm. Step 3.2. Refine the existing structure (bundle adjustment) Step 3.3. Initialize new structure points. Step 4. Refine the structure and motion through bundle adjustment MARC POLLEFEYS et al. Visual modeling with a hand-held camera Practical example Step 1. Match or track points over the whole image sequence. Practical example Step 2. Initialize the structure and motion recovery Practical example Step 3. For every additional view infer matches to the structure and compute the camera pose and refine the existing structure. Refined existed structure New image with more correspondence points E1-matrix P1,P2,Xj P3 Practical example Step 4. Refine the SfM through bundle adjustment Initial Model=Xj : j=1,2…n features Initial motion=Pi: i=1,2,..m images P1 structure P2 P3 P4 Motion P5 Pm-1 Bundle adjustment: refined Pˆ i , Xˆ j Pm min ∑ d ( ˆ P i ˆ X , x i 2 j ) Time (i) i j P ,x j ij Bundle adjustment - Problem Statement 1) Input:  3D points 𝑋𝑋𝑗𝑗 with  𝑗𝑗 = {1 … 𝑛𝑛}  𝑛𝑛 points  Viewed by a set of cameras Pi, with  𝑖𝑖 = {1 …. 𝑚𝑚}  m views 2) Problem:  given the projection 𝑥𝑥𝑗𝑗𝑖𝑖 of the n points on the m cameras (images), find: 1) The set of camera matrices 𝑃𝑃 𝑖𝑖 2) and the set of points 𝑋𝑋 𝑗𝑗  Such that: 𝑃𝑃 𝑖𝑖 𝑋𝑋 𝑗𝑗 = 𝑥𝑥𝑗𝑗𝑖𝑖  As the image measurements are noisy, the projection equation is not satisfied exactly. 𝑃𝑃 𝑖𝑖 𝑋𝑋 𝑗𝑗 ≠ 𝑥𝑥𝑗𝑗𝑖𝑖 Bundle Adjustment - Problem Statement 3) Goal:  Estimate projection matrices 𝑃𝑃 𝑖𝑖 and 3D points 𝑋𝑋 𝑗𝑗 which project exactly to image points 𝑥𝑥𝑗𝑗𝑖𝑖  minimizing the distance between the estimated 𝑥𝑥 𝑗𝑗𝑖𝑖 and the measured 𝑥𝑥𝑗𝑗𝑖𝑖 points for every view. min ∑ d ( ˆ i Xˆ , x ij ) 2 P i j P ,x j ij Where d(𝑥𝑥, 𝑥𝑥) is the geometric image distance between 𝑥𝑥 and 𝑥𝑥. Adjust the bundle of rays Adjust the bundle of rays between each camera center and between each 3D point and the the set of 3D points. set of camera centers. LSQ optimization tries (iteratively) to find the parameters P that minimize the difference between the measure 𝑥𝑥 and estimation 𝑥𝑥. The minimization is achieved using nonlinear least-squares algorithms, such as Gauss-Newton iteration and Levenberg-Marquardt iteration. Number of parameters to optimize Each camera matrix Pi has 11 DoFs. There are m views. Each point Xj has 3 DoFs and there are n points. There are 3n + 11m parameters to optimize !! (This is a lower bound, it can be more due to over-parameterization) normal equation matrix N size of (3n + 11m) (3n + 11m) have to be factored or inverted, which is very costly. Visual Odometry - definition Visual Odometry is the process of incrementally estimating the pose of the vehicle by examining the changes that motion induces on the images of its onboard cameras As Structure from Motion, several prerequisites are necessary:  Sufficient illumination of the surrounding environment  Dominance of static scene over moving objects  Textured surfaces to allow apparent motion to be extracted  Sufficient overlap between consecutive frames Visual Odometry is a particular case of Structure from motion.  Visual Odometry focuses on estimating the 3D motion of the camera sequentially (as a new frame arrives) and in real-time.  Bundle adjustment can be used (but it’s optional) to refine the local estimate of the trajectory. Visual odometry – general workflow Visual Odometry computes the camera path incrementally (pose after pose) It can work with monocular or stereocam systems → our focus is on monocular systems Image sequence Feature detection SIFT features tracks Feature matching (tracking) Motion estimation Ck+1 2D-2D 3D-3D 3D-2D Tk+1,k Ck Local optimization Tk,k-1 Ck-1 Visual Odometry vs SLAM The SLAM aims at obtaining a global, consistent estimate of the robot’s path. This is done through identifying loop closures. When a loop closure is detected, this information is used to reduce the drift in both the map and the camera path (global bundle adjustment). On the other hand, Visual Odometry aims at recovering only the path incrementally, frame-by-frame, and potentially optimizing only over the last m poses path (windowed or incremental bundle adjustment). Image courtesy of Clemente et al. RSS’07 Before loop closing After loop closing The loop allows to compensate the errors. Visual Odometry vs SLAM  VO only aims to the local consistency of the trajectory  SLAM aims to the global consistency of the trajectory and of the map  VO can be used as a building block of SLAM  VO is SLAM before closing the loop! Image courtesy of Clemente et al. RSS’07 Visual odometry  The choice between VO and Visual-SLAM depends on the trade-off between performance and consistency, and simplicity in implementation.  VO trades off consistency for real-time performance, without the need to keep track of all the previous history of the camera. Visual SLAM Visual Odometry – Problem formulation A robot is moving through an environment and taking images with a rigidly- attached system at discrete times k In case of a monocular system, the set of images taken at times k is denoted by: 𝐼𝐼0:𝑛𝑛 = 𝐼𝐼0 , … , 𝐼𝐼𝑛𝑛 Two camera positions at adjacent times are related by a rigid transformation Tk 𝑇𝑇0:𝑛𝑛 = 𝑇𝑇0 , … , 𝑇𝑇𝑛𝑛 contains all the subsequent motions. 𝐶𝐶0:𝑛𝑛 = 𝐶𝐶0 , … , 𝐶𝐶𝑛𝑛 contains all the transformations of the camera w.r.t. the initial frame C0 The current pose Cn can be computed by concatenating all the transformations Ck+1 With C0 being the camera pose at the Tk+1,k Ck instant k=0, which can be set arbitrarily by the user. Tk,k-1 Ck-1 Visual Odometry – Problem formulation The main task in Visual Odometry is to compute the relative transformations Tk from the images Ik and Ik-1 and then to concatenate the transformations to recover the full trajectory C0:n of the camera. This means that VO recovers the path incrementally, pose after pose. An iterative refinement over last m poses can be performed after this step to obtain a more accurate estimate of the local trajectory. 𝑪𝑪𝟎𝟎 𝑪𝑪𝟏𝟏 𝑪𝑪𝟑𝟑 𝑪𝑪𝟒𝟒 𝑪𝑪𝒏𝒏−𝟏𝟏 𝑪𝑪𝒏𝒏... Visual Odometry – Workflow Image sequence Camera modeling and calibration Feature detection As we saw before Feature matching Robust estimation (i.e. RANSAC) Motion estimation 2D-2D 3D-3D 3D-2D Error propagation Local optimization Camera-pose optimization (bundle adjustment) Visual Odometry – Motion estimation Motion estimation is the core computation step performed for every image in a Visual Odometry system It computes the camera motion Tk between the previous and the current image: The full trajectory of the camera can be recovered by concatenation of all these single movements:... The way the motion is computed is the same as seen in SfM Visual Odometry – Motion estimation Depending on whether the feature correspondences are specified in 2D or 3D, there are three different cases:  2D to 2D: when the features are in image coordinates (as in SfM)  3D to 3D: the image features are triangulated in object space and the motion is estimated using these points (not very common)  3D to 2D: 3D features are re-projected on the image (as seen in SfM) Motion estimation – 2D-to-2D  Both set of features 𝑓𝑓𝑘𝑘−1 and 𝑓𝑓𝑘𝑘 are specified in 2D  The estimation of the relative position of the images is needed  The F or the E matrix can be computed then.  Having more than the minimum number of points, the solution is found by determining the transformation that minimizes the reprojection error of the triangulated points in each image Motion estimation – 3D-to-2D  𝑓𝑓𝑘𝑘−1 is specified in 3D and 𝑓𝑓𝑘𝑘 in 2D  This problem is the DLT seen before  The solution is found by determining the transformation that minimizes the reprojection error  In the monocular case, the 3D structure needs to be triangulated from two adjacent camera views (e.g., 𝐼𝐼𝑘𝑘−2 and 𝐼𝐼𝑘𝑘−1) and then matched to 2D image features in a third view (e.g., 𝐼𝐼𝑘𝑘 ). Motion estimation – 3D-to-3D  Both 𝑓𝑓𝑘𝑘−1 and 𝑓𝑓𝑘𝑘 are specified in 3D  To do this, it is necessary to triangulate 3D points (e.g. use a stereo camera)  The minimal-case solution involves 3 non-collinear correspondences  The solution is found by determining the aligning transformation that minimizes the 3D-3D distance ONLY STEREO CAMS Motion estimation – Key-frame selection  Several hundreds of frames can be acquired in few seconds of video. Most of the frames give similar information and they are maybe not useful for VO purposes.  As already seen 3D points are determined by intersecting (see triangulation) the rays from 2D image correspondences of at least two image frames.  However, they never intersect due to:  Image noise  Camera model and calibration errors  Feature matching uncertainty  Geometric configuration of the 2 images Motion estimation – Key-frame selection  When frames are taken at nearby positions compared to the scene distance, 3D points will exibit large uncertainty.  As a consequence, 3D-3D motion estimation methods will drift much more quickly than 3D-2D and 2D-2D methods Motion estimation – Key-frame selection  One way to avoid this problem consists of skipping frames until the average uncertainty of the 3D points decreases below a certain threshold. The selected frames are called key-frames.  Key-frame selection is an important step in VO. It is done before updating the motion to reduce the error propagation.... Key-frame should be still similar in order to find common features! Motion estimation – Considerations  Stereo vision has the advantage over monocular vision that both motion and structure are computed in the absolute scale. It also exhibits less drift (at least in indoor environments).  When the distance to the scene is much larger than the stereo baseline, stereo VO degenerates into monocular VO.  Key-frames should be selected carefully to reduce the drift.  Regardless of the chosen motion computation method, local bundle adjustment (over the last m frames) should be always performed to compute a more accurate estimate of the trajectory.  After bundle adjustment, the effects of the motion estimation method are much more alleviated (as long as the initialization is close to the solution). Visual Odometry – drift error As we have seen in SfM, VO suffer from the same problem. The errors introduced by each new frame-to-frame motion accumulate over time. This is often not perceived locally, but after a long sequence, it is much more relevant. This generates a drift of the estimated trajectory from the real path. Ck+1 Ck Tk+1,k Tk,k-1 The uncertainty of the camera pose at Ck Ck-1 is a combination of the uncertainty at Ck-1 (black solid ellipse) and uncertainty of the transformation Tk,k-1 (gray dashed ellipse) Motion estimation – Error propagation  The uncertainty of the camera pose 𝐶𝐶𝑘𝑘 is a combination of the uncertainty at 𝐶𝐶𝑘𝑘−1 (black-solid ellipse) and the uncertainty of the transformation 𝑇𝑇𝑘𝑘 (gray dashed ellipse)  𝐶𝐶𝑘𝑘 = 𝑓𝑓(𝐶𝐶𝑘𝑘−1 , 𝑇𝑇𝑘𝑘 ) Ck+1 Ck Tk+1  The combined covariance ∑𝑘𝑘 is Tk Ck-1  The camera-pose uncertainty is always increasing when concatenating transformations. Thus, it is important to keep the uncertainties of the individual transformations small Windowed Bundle Adjustment 𝑪𝑪𝒏𝒏−𝒎𝒎 𝑪𝑪𝒏𝒏−𝒎𝒎+𝟏𝟏 𝑪𝑪𝒏𝒏−𝒎𝒎+𝟐𝟐 𝑪𝑪𝒏𝒏−𝒎𝒎+𝟑𝟑 𝑪𝑪𝒏𝒏−𝟏𝟏 𝑪𝑪𝒏𝒏...... 𝑻𝑻𝟏𝟏 𝑻𝑻𝟐𝟐 𝑻𝑻𝟑𝟑 𝑻𝑻𝒏𝒏 𝑻𝑻𝟑𝟑,𝟏𝟏 𝑻𝑻𝟒𝟒,𝟏𝟏 𝑻𝑻𝒏𝒏−𝟏𝟏,𝟑𝟑 𝒎𝒎  BBA optimizes 3D points and camera poses  For efficiency, only the last 𝑚𝑚 keyframes are used  The initialization should be close the minimum to not get stuck in local minima Windowed Bundle Adjustment  The choise of the window size m is governed by computational reasons  The computational complexity of BA is 𝑂𝑂 𝑞𝑞𝑞𝑞 + 𝑙𝑙𝑙𝑙 3 with 𝑁𝑁 being the number of points, 𝑚𝑚 the number of poses, and 𝑞𝑞 and 𝑚𝑚 the number of parameters for points and camera poses  Other sensors can be used such as  IMU (called inertial VO) Integration using Extended Kalman Filter  Compass (or similar)  GNSS SLAM - Loop detection  Loop constraints are very valuable constraints for pose graph optimization.  These constraints form graph edges between nodes that are usually far apart and between which large drift might have been accumulated.  Events like reobserving a landmark after not seeing it for a long time or coming back to a previously-mapped area are called loop detections.  Loop constraints can be found by evaluating the visual similarity between the current camera images and past camera images.  Visual similarity can be computed using global or local image descriptors  Bag of words “summarizes” the information stored by all the descriptors in an image into a database that allows to quickly compare keyframes and detect similarities. Find the trade off between “summarizing” and get good First observation recalls Second observation after a loop Image courtesy of Cummins & Newman, IJRR’08 Loop detection Loop closure is run for new keyframes and consists of (OrbSLAM3): 1)Query of the bag-of-words to search for other keyframes (Km) that look similar but not visible by the current frame (Ka) 2) Definition of a local window that includes Km and adjacent keyframes (and corresponding 3D points). Feature matching is run between Ka and the other keyframes. 3D-3D match is also performed on map points. 3) Compute the 3D transformation Tam that better aligns the map points in Km local window with those of Ka. A minimal set of three 3D-3D matches to iteratively (RANSAC) find each hypothesis for Tam is run. The solution with maximum number of inliers is taken. 4) Guided matching refinement. The number of correspondences is maximized and refined. A new is computed with a non-linear optimization. 5) Further verification with other keyframes. 6) Pose-graph optimization to propagate the loop correction to the rest of the map. The final step is a global BA to further optimize the loop closure mid-term and long-term matches. Monocular SLAM SLAM performed using a single camera → most of the drones have a single camera, not a stereo-rig RGB cameras used in most of the cases and few examples with thermal cameras Two main typologies of algorithms: 1) Feature-based methods Use of distinctive features 2) Direct methods Use of entire images and photometric consistency Monocular SLAM – Feature-based methods Track features using different approaches ORB-SLAM 3 → recent a widely used solution  Tracks FAST features + ORB descriptors  Can be visual or visual-inertial → combination to bridge gaps  Place recognition, map merging and loop closure using bag of words Nice combination of short- medium long-term data association to reduce drifts https://www.youtube.com/wat ch?v=ufvPS5wJAx0 ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, Campos et al., 2020, under review Monocular SLAM – photometric loss definition Given a point X in the 3D space 𝑋𝑋 = 𝑋𝑋, 𝑌𝑌, 𝑍𝑍 𝑇𝑇 X Its projection in the reference image is: 1 𝑥𝑥𝑟𝑟𝑟𝑟𝑟𝑟 = 𝐾𝐾𝑋𝑋 x’ 𝑍𝑍̂ x And the 3D point can be expressed as: C C’ Reference image Target image ̂ −1 𝑥𝑥𝑟𝑟𝑟𝑟𝑟𝑟 𝑋𝑋 = 𝑍𝑍𝐾𝐾 The projection of the point in the target image can be expressed as: 𝑥𝑥𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝐾𝐾 𝑅𝑅|𝑇𝑇 𝑍𝑍𝐾𝐾 −1 𝑥𝑥𝑟𝑟𝑟𝑟𝑟𝑟 By having the depth information (i.e. Z) for all the pixels, we can form a warped image in a new position (i.e. target image) by moving (warping) the pixels of the reference image into the target position. Monocular SLAM – photometric loss definition Using the warping concept, we can define the photometric loss as the difference between the warped image and the target image. X x x’ Given a frame with [H,W] size C C’ Reference image Target image If the depth and the pose of both images and the depth are correct the loss will be small, and vice versa if these two are incorrect By optimizing the photogrammetric loss we can optimize the pose. This was traditionally achieved by using non-linear least square optimizations such as levemberg Marquardt or Gauss-netwton algorithms Monocular SLAM – dense direct methods Dense Tracking And Mapping (DTAM): computes pose and depths minimizing photometric losses Projective photometric cost volume Pixel-level photometric error 1) Construct the cost volume considering the photometric loss 2) It minimizes the cost volume taking the minimum depths R. A. Newcombe, S. J. Lovegrove and A. J. Davison, DTAM: Dense tracking and mapping in real-time. International Conference on Computer Vision, 2011, pp. 2320-2327, 10.1109/ICCV.2011.6126513 Monocular SLAM – dense direct methods The more images the better! DTAM optimizes first the depths and with these depths we optimize the poses and we processed iteratively R. A. Newcombe, S. J. Lovegrove and A. J. Davison, DTAM: Dense tracking and mapping in real-time. International Conference on Computer Vision, 2011, pp. 2320-2327, 10.1109/ICCV.2011.6126513 Monocular SLAM – semi-dense direct methods Direct methods do not extract features, but use directly the pixel intensities in the images, and estimate motion and structure by minimizing a photometric error. The depth map are not created for all the pixels, but only for those in the neighborhood of large image intensity gradients, making them semi-dense. Three main steps:  Tracking  Depth map estimation  Map optimization LSD Slam https://vision.in.tum.de/_media /spezial/bib/engel14eccv.pdf Larger cumulation of the error (3D registration new images) Pure rotations give problems Monocular SLAM – with DL approaches Deep learning-assisted visual SLAM can support traditional SLAM algorithms in different steps of the process:  Feature extraction and matching  Loop closure  (Optimization) Other approaches use DL for the complete SLAM pipeline. Three different typologies of SLAM using DL can be catalogued:  Supervised: ground truth is used to train the network. The CNN regress the camera pose and the depth.  Self-supervised: the photometric loss is used to train without the need for ground truth. Image warping from one frame to another frame is used to determine the pose and minimize the loss by back- propagation  Hybrid supervision: many supervision labels such as the real pose, depth map that are used during the training to support the process. Stereo-pair pose estimation with DL: PoseNet Goal: camera pose regression with CNN and a supervised approach Compared to previous methods it simultaneously learns position and rotation Camera pose is given by translations (x) and rotations (q, in quaternion form) The supervised training try to minimize the following loss: 𝑞𝑞 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 = 𝑥𝑥 − 𝑥𝑥 2 + 𝛽𝛽 𝑞𝑞 − 𝑞𝑞 2 To keep in count the different weight of x and q: change from indoor to outdoor datasets Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, in ICCV 2015 Stereo-pair pose estimation with DL: PoseNet In the original implementation a modified version of GoogleNet was used: Compared to the original GoogleNet:  Softmax classifiers are removed.  Fully connected layers are modified to output orientation parameters Stereo-pair pose estimation with DL: PoseNet Higher performances in challenging situations: blurred images, hard weather conditions → able to see contours and homogenous areas Still relatively low accuracies On GPU, faster than traditional algorithms Stereo-pair pose estimation with DL: PoseNet The system is robust to wider baselines between images Different distribution of salient areas in the image compared to traditional approaches Transfer learning has similar problems compared to other tasks (e.g. classification): training still needs to be “close” to the testing datasets Different implementations have been inspired by the initial algorithms: the use of direct methods (see next presentation) have further improved the results. Conventional methods vs DL ones Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, in ICRA 2017 DeepVO Supervised method Initial convolutions on two stacked images to derive features Use of Recurrent Neural Networks to keep the connection among consecutive images Each new image makes use of the information that comes from previous epochs using a LSTM setting (Long-Short Term Memory) Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, in ICRA 2017 DeepVO The loss function minimises the residuals of positions (p) and orientations (φ). 𝑁𝑁 𝑡𝑡 1 2 2 𝜃𝜃 ∗ = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑁𝑁 𝑝𝑝 𝑘𝑘 −𝑝𝑝𝑘𝑘 2 + 𝑘𝑘 𝜑𝜑 𝑘𝑘 −𝜑𝜑𝑘𝑘 2 𝑖𝑖=1 𝑘𝑘=1 𝜃𝜃 Balance the weights of positions and orientations DeepVO has often overfitting problems Results are strongly influenced by the length of the training sequence It can often outperform traditional methods, if the training set is sufficiently large Scale can be learnt during the end-to- end training SfMLearner Combine camera pose estimation and SIDE networks to retrieve structure and motion of the camera Despite being jointly trained, the depth model and the pose estimation model can be used independently during test-time inference. It embeds projective geometry into the learning process 𝑝𝑝𝑠𝑠 ~𝐾𝐾 𝑇𝑇 −1 𝑡𝑡→𝑠𝑠 𝐷𝐷𝑡𝑡 𝐾𝐾 𝑝𝑝𝑡𝑡 Self-supervised method Where:  𝑝𝑝𝑠𝑠 point in source image  𝑝𝑝𝑡𝑡 in the target image  K camera matrix 𝑡𝑡 is the predicted depth and 𝑇𝑇𝑡𝑡→𝑠𝑠 is the predicted relative pose  𝐷𝐷 Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017 SfMLearner As SfM we assume (1) static scene, (2) no occlusions (3) non-Lambertian surfaces 𝑠𝑠 is introduced to consider possible wrong correspondences A soft mask 𝐸𝐸 𝐿𝐿𝑣𝑣𝑣𝑣 = 𝑠𝑠 𝑝𝑝 𝐼𝐼𝑡𝑡 𝑝𝑝 − 𝐼𝐼 𝐸𝐸 𝑡𝑡 𝑝𝑝 View synthesis loss ∈𝑆𝑆 𝑝𝑝 𝐿𝐿𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = ∑𝑙𝑙 𝐿𝐿𝑙𝑙𝑣𝑣𝑣𝑣 + 𝜆𝜆𝑠𝑠 𝐿𝐿𝑙𝑙𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝜆𝜆𝑒𝑒 ∑𝑠𝑠 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 𝐸𝐸 𝑠𝑠𝑙𝑙 Cross entropy loss Smooth gradients DVSO It is a hybrid method combining deep learning depth prediction and traditional direct sparse odometry (DSO) results. The training combines depth predictions given by traditional (i.e. DSO) and DL methods. Three elements: self-supervised learning from photo consistency, supervised learning based on traditional direct methods, and Stacknet for monocular depth estimation. They get a good estimation of Depth using depth and then use this for the optimization of the camera pose like a (virtual) stereo approach. N. Yang, R. Wang, J. Stückler, D. Cremers. Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry. In ECCV2018. https://cvg.cit.tum.de/research/vslam/dvso Monocular SLAM – densification Sparse reconstructions can be densified with dense stereo reconstruction Growing trend in the use of deep learning to estimate depths: scale? ORB SLAM + Single image depth estimation to densify the 3D reconstruction Still many errors on the border of each frame + transferability limitations CNN based dense monocular visual SLAM for indoor mapping and autonomous exploration, Anne Steenbeek’s MSc thesis, 2019 Loop closure with DL Researchers have proposed to use the ConvNet features, that are from pre- trained neural models on large-scale generic image processing dataset. Image similarity is calculated with the norm of the feature vector to determine whether a loop exists. Another approach is to use deep auto-encoder structure to extract a compact representation, that compresses the scene in an unsupervised manner. Other specific networks have been designed to quickly detect if that region has been visited or not (novelty networks) combining it with “dictionary” that stores the information of previous frames. https://doi.org/10.1016/j.robot.2020.103470 https://arxiv.org/abs/1805.07703 Windowed Bundle Adjustment with DL LS-Net tackles this problem via a learning-based optimizer by integrating analytical solvers into its learning process. It learns a data-driven prior that is then improved by refining neural network predictions with an analytical optimizer to ensure photometric consistency. It can optimize sum of squares objective functions in SLAM algorithms, which are often difficult to optimize due to violated assumptions and ill-posed problems. The networks (including those of the optimizer) are trained using a supervised loss. Windowed Bundle Adjustment with DL BA-Net integrates Levemberd Marquardt into a deep neural network for an end-to-end learning. Instead of minimizing geometric or photometric error, BA-Net is performed on feature space to optimize the consistency loss of features from multi view images extracted by ConvNets. The feature-level optimizer can mitigate problems of geometric or photometric solution (e.g. some information lost in the geometric optimization, while environmental dynamics and lighting changes may impact the photometric optimization). pose depth pixel features Feature-metric difference to be minimized https://arxiv.org/pdf/1806.04807 Deep active localization Active localization consists of generating robot actions to maximally disambiguate its pose within a reference map The system is composed of two learned modules: a convolutional neural network for perception, and a deep reinforcement learned planning module. Deep learning applied to the different components allows to get better results! Relevant references Overview of deep learning application on visual SLAM, Li et al., 2022. in Displays, vol. 74, September 2022,102298. https://www.sciencedirect.com/science/article/pii/S0141938222001160?via%3 Dihub Deep Learning for Visual Localization and Mapping: A Survey. IEEE Transactions on Neural Networks and Learning Systems, https://arxiv.org/abs/2308.14039 S. Mokssit, D. B. Licea, B. Guermah and M. Ghogho, "Deep Learning Techniques for Visual SLAM: A Survey," in IEEE Access, vol. 11, pp. 20026- 20050, 2023, doi: 10.1109/ACCESS.2023.3249661 AI FOR AUTONOMOUS ROBOTS SFM, SLAM AND VO DR. FRANCESCO NEX

Document Details

Tags

Related

Full Transcript

Upgrade to continue