AIrobotsexam.pdf

AI for autonomous robots exam Machine Learning paradigms Learning objectives: Define and distinguish different machine learning paradigms Discuss selected learning algorithms...

AI for autonomous robots exam Machine Learning paradigms Learning objectives: Define and distinguish different machine learning paradigms Discuss selected learning algorithms Recognize how paradigms are used in certain applications AI: enabling computer to perform specific actions using explicit rules Machine learning: Implicitly learning above rules from observational data Importance of paradigms: Common language Problem-specific solutions Understanding diversity Relevant ML Paradigms Supervised Learning Learning from examples Examples: Email-classifier, image classification, speech recognition, language translation, convolutional network for scene understanding Learning from already labelled data Function approximation Deduction: When equation is known and model is tested based on that Induction: Making the model by for example measuring events happening at a specific time (x is at point y at time z) Advantages: Much labelled data available (public datasets) Very high accuracy Classification: Classifying to which category an object belongs (e.g. deer) Regression: Evaluating a property of an object (e.g. hair density) Key concepts: Sensor data Labelled training data Supervised Training Validation and Test Data: ▪ 1. Train model with training data ▪ 2. Validate model during training with validation data (Used to validate the model during training. Helps in tuning hyperparameters and making decisions about model architecture (e.g., number of layers, learning rate). ▪ 3. Test model with test data Evaluation Metrics: Overall accuracy Precision-recall curve Intersection over union (IOU) (The higher the better) Root mean square error (RMSE) Unsupervised Learning Learning similarities between examples No labelled examples Learning patterns and (underlying) structures No explicit supervision Hidden insights, similarities, groupings within data (abnormal detection algorithms) Goal: Learn hidden representation of data Examples: Clustering (grouping of similar items) Dimensional reduction (reduce number of features while preserving important things) Anomaly detection (identifying abnormal data is like finding a needle in the hay) Applications: Unsupervised learning to initialise parameters (pretraining) Deep neural networks to do clustering and learning Evaluation Metrics: Silhouette Score (measures quality of clusters; the higher the better) Davies-Bouldin Index (the lower the better) Reconstruction error Anomaly Score Semi-supervised Learning Learning from limited examples Labelled and unlabelled data  Unlabelled is larger amount  Better than nothing Challenges in Labelled data acquisition: ▪ Cost and time requirements ▪ Scarcity of Labels ▪ Labelling bias (e.g. brides can look very different in different cultures) ▪ Dynamic nature of the data (labels might not stay accurate over time) Learning Techniques: ▪ Pseudo-labelling: Model is first trained with the labelled examples and then predicts the unlabelled examples based on that. Afterwards it will treat the new labels as true and further train. ▪ Consistency regularization: model is encouraged to treat an input and its modified version similar ▪ Self-training (model is using own predictions as pseudo labels) vs co-training (doing labelling techniques on different models; models create labels for each other) Applications: ▪ Popular is label-scarce applications (network intrusion detection) ▪ Medical image segmentation Self-supervised Learning Making up your own examples to learn from Generate supervisory signals from input data Designing pretext tasks Learning from intrinsic signals: ▪ Spatial or temporal relationships ▪ Data transformation (turning RGB image to gray) Pretext tasks: Tasks where the model predicts one part of the data from another ▪ Temporal ordering ▪ Context prediction ▪ Spatial relations ▪ Generative model: generate data given incomplete input; capture meaningful relations Learning techniques: ▪ Generative: generate new data similar to training data ▪ Contrastive: contrasting positive/similar data points against negative/dissimilar data points ▪ Adversarial: making model robust against adversarial attacks Applications: ▪ Depth estimation ▪ 3D reconstruction Reinforcement Learning Learning without examples Learning by interacting with environment Active rather than passive as other learning tasks Optimization of a rewards signal Feedback is evaluative (does not represent correctness but goodness instead) Goal: maximize future reward Reward hypothesis: “Any goal can be formalized as the outcome of maximizing a cumulative reward” DRL algorithms: ▪ Value based (learn values; implicit policy (e.g. greedy)) ▪ Policy based (no values; learn policy) ▪ Actor critic (learn values; learn policy) Examples: ▪ Autopilot ▪ Boston spots locomotion control Semantic Segmentation Learning objectives: ▪ Define semantic segmentation ▪ Discuss popular network architectures ▪ Recognize real world applications Semantic Segmentation: Semantic segmentation involves dividing an image into multiple segments or regions and assigning each pixel in the image a label corresponding to the semantic class it belongs to Applications: ▪ Autonomous robots ▪ Medical Imaging ▪ Satellite Image Analysis ▪ Surveillance Systems Traditional Image Segmentation methods: ▪ Low level features ▪ Thresholding, edge detection, region growing clustering ▪ Pixel isolation ▪ Lack of semantic meaning and context Pixel level classification: ▪ Labels are assigned to each pixel ▪ Minimize differences between prediction and ground truth ▪ Inference: predicts a label for each pixel Role of CNN: ▪ Extract hierarchical features ▪ Encoder: Extracts high level features using convolutional and pooling layers (reduces spatial dimensions) ▪ Decoder: reconstructs spatial dimensions using upsampling technique ▪ Predicts labels in real time ▪ Enables accurate segmentation Evaluation Metrics: ▪ Pixel accuracy ▪ Mean pixel accuracy ▪ Precision, Recall, F-score ▪ Mean Intersection over Union ▪ Frequency Weighted Intersection over Union ▪ Dice coefficient Loss functions: ▪ Cross-Entropy loss ▪ Dice loss ▪ Weighted loss functions ▪ Combination loss function ▪ Binary cross entropy loss Semantic Segmentation Network architecture Sliding window CNN: Very inefficient! Not reusing shared features between overlapping patches Fully convolutional: Cow cow cow paradigm: performing multiple convolutions on the original image resolution o Very expensive Hourglass networks: Bunch of cnns with down and upsampling Unpooling: Nearest Neighbour Upsampling: Transposed convolution: Used for increasing the resolution of an image Semantic segmentation: RGB image as input and class based visualisation as output Fully convolutional networks: Deep learning architecture used for dense pixel-wise prediction tasks Entirely convolutional Encoder-decoder structure Utilizes skip connections (combines features from different layers) for multi- scale feature fusion U-Net: U shaped architecture Expansive path for localization Contracting path for context Skip connections for detailed spatial information Tailored loss function for semantic segmentation Deeplab: Atrous convolutions Atrous Spatial Pyramid Pooling Segnet: Encoder-decoder architecture Uses max-pooling for upsampling Skip connections for multi-scale context Efficient balance between accuracy and computational complexity Importance of High Quality Data: Accuracy of ground truth lables Robustness to variations Generalization across domains Mitigating bias and Inequality (bride looks different in different countries) Labelling Techniques: Manual labelling for accuracy Semi-Automatic labelling for efficiency (using online resources such as labelme) Crowdsourced labelling for scalability (various people are hired to do labelling) Active learning for informative data selction (use the updated model to select new uncertain texts, get labels, and retrain.) Transfer learning for knowledge transfer (model developed for a similar task is reused) Data augmentation Applications Scene understanding and navigation (lane detection) Object detection and localization (pixel-level labeling for precise localization; boundary detection;…) Environmental Variations: Illumination changes Weather Viewpoint variability Occlusion and clutter Real-time processing: Computational complexity optimization Efficient model architecture Hardware acceleration techniques Quantization and pruning for model efficiency Integration: Compatibility and interoperability System-level optimization Sensor fusion and data fusion Object detection Detecting objects of certain classes Locating objects using bounding boxes Classifying objects in categories of that particular object  Object detection is a fusion of object localization and classification tasks Challenges: o Illumination o Object pose o Clutter o Occlusion o Intra-class appearance o Viewpoint Challenges: ▪ Bad localization ▪ Confused with similar objects ▪ Confused with dissimilar objects ▪ Misc. Background General workflow: Specifying an object model 1. Statistical template in bounding box (object is some (x,y,w,h) in image) 2. Articulate model parts (object consists of parts and each part can be detected) 3. Hybrid template/parts model Feature representations: ▪ Pixel-based representations are sensitive to small shifts ▪ Colour based appearance is sensitive to illumination and intra-class appearance variations  Possible solution: use edges and intensity oriented gradients Radiometric information is summarized in a set of features (list of numbers) that offer invariance to small shifts, rotations, illuminations Similarity can also be measured considering a very large set of features given by the CNNs Generating hypotheses o Use of sliding window: Test patch at each location and scale (classify each window separately) o Voting from patches/keypoints (look at interest points and matched codebook entries and then do a probabilistic voting) o Region based proposal: generate candidate regions that may contain objects Score Hypotheses o Sort hypothesis to see which solution is most feasible o Mainly-gradient based features, usually based on summary representation, many classifiers Resolve Detections o Rescore to select best hypothesis o Rescore each proposed object, based on the whole set and select the best one Intersection over Union o Function used to evaluate the object detection algorithm and resolve ambiguities  Measure of the overlap between two bounding boxes  Closer to 1 the better Numerator: area of overlap between predicted bounding box and ground truth bounding box Denominator: Area of union, area encompassed by the predicted box and true box Non-max suppression o Discard bounding boxes with a confidence score below 0.6 (create several boxes for one object with different IoU score o Discard bounding boxes with high IOU with the selected bounding box (if it is higher than 0.5 the probability that they are pointing at the same object is high and the boxes with low probability can be removed) 2 stage framework: Divides detection process into region proposal (defining regions of interest using reference boxes) and classification stage (classifying region proposals and improving their localization)  Slower but more accurate  Can not be used real time Examples: R-CNN, Fast R-CNN, Faster R-CNN and so on lol 1 stage detectors: Single feed-forward CNN that directly outputs bounding boxes and classification  Easier to deploy on edge devices  Can be used real time  Trained on large data sets and then finetuned  Most used for robotics Examples: SSD, YOLO series, etc RCNN (region based convolutional neural network): 1. Input 2. Extract region proposal 3. Compute CNN features 4. Classify regions (with SVMs) Problem: CNN has to be performed around 2000 times cause each region is forwarded through the network  CNN is computed for each region of interest Fast RCNN: Region proposal is done in the feature space and not in the image space  First features are extracted and then RoI are selected Faster RCNN: 1. Run once per image  Region proposal network  Further reducing the number of possible proposals 2. Run once per region  Crop features: RoI pooling  Predict class  Predict bounding box offset o Improved speed and accuracy o Boxes of predefined shape YOLO o You only look once o 2 objects per cell are considered o Single CNN for classifying and localizing the object using bounding boxes 1. Input image is divided into 7x7 grid cells 2. Image is processed in the backbone network to be represented by features 3. Using another convolutional layer; learning kernel (similarity) parameters which combine the 512 feature maps to produce activation which corresponds with the gird cell that contains the object  Weight need to be learned across all feature maps  Determine: o which grids probably contain an object o what classes are likely present in each cell o bounding box in each grid cell 4. determine bounding box for each grid cell 5. Full output of applying convolutional filters: one bounding box descriptor per grid cell 6. Based on training data bounding box priors and anchor boxes (5 boxes) have been added to newer yolo versions 7. Multiple boxes can be determined for each object (NMS and IOU are used to get one prediction per object) YOLO V8:  Improved detection and accuracy  Supports various vision AI tasks (allows usage in various applications and domains) Object tracking Follow an object in a sequence of images. Given the position of the object in the previous image predict where it will be in the next image Future position can be predicted using Kalman filters or particle filters Problem statement: Input: target in image Objective: estimate target state (position) over time (((((Possibilities: o Object representation o Similarity measures o Searching process)))) Challenges Variations due to geometric change of object Variations due to different illumination Occlusions Image quality (resolution, blur) Similar objects in the scene Non-linear motion of object … Main steps of the algorithm Main components of a tracking system: Detection and loss detection: (re-)initialize the system Estimation: measurement processing and state update Models: all useful prior information about object, sensors and environment Object tracking algorithms Can be classified according to: Motion estimation can be: Deterministic: modelled using a cost function Probabilistic: motion model is considered in a statistical fashion Appearance/On-line: only info is given by images (classification) How the object can be represented: Point tracking: Key parts of the object are tracked Appearance tracking: specific image region is tracked (focus on visual properties of the object; e.g. colour, shape) Silhouette tracking: search for the same shape in the sequence Deterministic methods Benefits: o Quick convergence o Efficient o Rely on momentary measurments Disadvantages: o Can only be used offline o Can get stuck in a local max or min Probabilistic methods Benefits: o Can be used real time o Consider history, entire probability densities, multi-model o Weaker requirements on accuracy of sensors/model o Seamless integration of constraints/models o Accounts for uncertainty Disadvantages: o Slower o Integration Deterministic models Object movements are assumed to follow a trajectory prototype which can be learned offline or defined by a model Trajectory is determined by energy function where each term has a different weight that has to learned during training:  Energy function must be convex to be correctly optimized (unique local max or min)  Optimization procedure should minimize the energy  Approach can even give good results for poor images  Not well suited for complex movements Probabilistic models Bayes filtering algorithms: estimate posterior belief and state distribution based on control data and sensor measurements  Info is represented as probability density function Recursively: Predict motion model Receive info from sensors (e.g. images) Correct prediction Main components: State: contains all info that is needed to be known about system; considered complete when it is the best predictor for the future (Markov chain) Measurements: provide info about state (e.g. images) Control inputs: change of dynamic system Models: Object (Temporal dynamics), Sensors (characteristics), Context (background modelling) Inference:  Bayesian inference: estimate state x given measurement data z, control data u, respective state transition and measurement probabilities  Belief: estimate of true state Recursive Bayes filter algorithm: Prediction is corrected at each frame and model updated Prediction/Time update: Calculate prior belief on dynamic model Correction/Measurement update: Calculate posterior belief based on measurement model; update belief based on new measurement and incorporate measurement using Bayes theorem Regular Kalman filter: Used for linear and regular dynamic models (airplanes) Extended Kalman filter: Non-linear models are assumed Appearance modelling: Appearance of object needs to be modelled in a way that it is detectable in different frames  Representation needs to be flexible enough to cope with different scales, poses, illuminations and occlusions Object approximation: Segmentation Polygonal approximation Bounding ellipse/box Position only  Goal: measure affinity in different frames Appearance tracking 1) Lukas Kanade tracker: Minimizes sum of squared differences Gradient descent Different weight is assigned to pixels of the patch with pixels in the middle having more weight 2) Mean Shift: Build from spatial templates colour histogram Aims to maximize coefficient by assessing mean shift: difference between mean and median values Density is given by a certain colour Not really used 3) Correlation filter: Better computational speed/ faster Use Fast Fourier transform to determine position of template patch Optimizes a distance metric between an ideal desired correlation output for an input image and the real correlation output of the training images with filter template Target is tracked over search window in next frame: new position of the target is where the correlation output is maximum  Template of the target object is compared to different regions to find best match  High response for target and low response for background MOSSE finds filter H that minimizes sum of squared error between actual output of convolution and desired output of convolution 4) Feature representation: Features are extracted from image window 1. Learning phase: learning how to recognize object, discriminative features are needed 2. Detection phase: given object description algorithm searches for same object in scene captured by following frames 3. Update: descriptor is updated using most recent frame CNN can be used to extract features 5) DeepSORT: Uses CNN to improve ability to track objects in the presence of occlusion Evolution of SORT: uses first Kalman filter in image space and then frame by frame data association using the Hungarian method that measures bounding box overlap  DeepSORT replaces association metric with a more informed metric that combines motion and appearance information defining a deep appearance descriptor  Limited number of features Hungarian algorithm: Matching based on cost matrix that considers mahalanobis distance for motion consistency and cosine distance for appearance similarity Effective combinatory optimization algorithm Able to cope with occlusions and crowded areas Deep learning ensures that what is identified is correct (makes tracking more reliable) Structure from motion problem: Figuring out how the camera moved through space and creating 3D structure of scene Visual odometry: estimating the position of a camera by looking at changes that are caused by motion. Looking at different features and tracking them 3D Stereo Reconstruction Must be done real time Stereo cameras: can deliver scaled scenes. Have two lenses which allows them to mimic the human eye by tacking two slightly different pictures. This way perspective can be created Mono camera: no scaled scenes Why is stereo reconstruction needed? o Avoiding collision of robot with objects o Better understand semantics of scene o 3D mapping of environment Pinhole camera model ▪ Mathematical relationship between 3D point and projection into image plane ▪ When you make a very small hole in a box what is in the outside world will be projected into the back of the box; therefore pinhole camera Epipolar plane: Center of two cameras plus a point in the real world Conjugate points: Points in image plane the real life point projects to in each camera (p1, p2) Epipoles: projection of one camera into the other; all epipolar lines go through these points Epipolar lines: Point in one image is a line in the other (cause it is seen from different perspectives) Epipolar plane is defined by P1, e1 and center of camera Epipolar constraint: a point in one image must lay on a line in the other image; so concrete position is not directly known F = fundamental matrix P = point in image Ftq = epipolar line associated with q  Epipolar line of point p is Fp Fundamental matrix: described epipolar geometry of two views (3x3 matrix)  Epipolar plane is mathematically describe using the foundation matrix Homogenous coordinates: A point in n-dimensional space is represented by n + 1 coordinates ▪ Through triangulations points in epipolar plane can be found Adjusting the images/resample images in a way that the epipolar lines are horizontal will make search process a lot easier since it is only a 1D search in this case Disparity map: Apparent motion of objects between two stereo images (closing one eye and opening it rapidly while closing the other is disparity) ((apparent pixel difference or motion between a pair of rectified stereo images))  Depth is given by disparity  2D image where each pixel represents a disparity between corresponding points of the two images of a stereo pair Image matching Image matching algorithms identify and match different features such as edges, points, patterns of two or more images.  Correspondence of same point in image space allow to determine points position in object space  Feature matching: matching different features; used in image orientation  Area based matching: used in dense point cloud generation Feature matching ▪ Matching is mostly just done at salient points (corners, blobs,…) Identifying a corner: o Identified based on gradients (directional change in the intensity of the image) o Two gradients with a perpendicular direction are needed to detect a corner; one can only detect an edge Identifying blobs: o Regions with positive or negative colour value compared to neighbourhood (e.g. dark spot) 1) Filtering of image to represent different scales (gaussian blurring) (scale- space representation) 2) Filtered image at one scale is subtracted from filtered image at previous scale 3) Check for local extrema across scale Possible Scales: Area based matching Cross correlation: determine position in search image with a high similarity with a reference pattern (goes from -1 (no similarity) to 1)  Repeated for all pixels to get dense reconstruction Problem with correlation: max correlation does not necessarily be the correct match cause of noise, occlusions. In textureless areas best match is arbitrary  Only considering cross correlation is not very accurate  Solution: Also consider neighbouring points o Look for consistency of each point location in reference to neighbouring points Local matching: only considers small part of image (window based algorithm; correlation) Global matching: considers large area of image; estimate disparity such that a global energy function is minimized; more accurate and robust: better suited for complex scenes 2D optimization on whole image Energy function: E(d) = Edata(d) + Esmooth(d) d: disparity image Edata: total aggregate costs given disparity d (e.g. cross correlation), higher correlation leads to lower Edata Esmooth: Energy that codes smoothness (e.g. penalty for disparity changes in close neighborhood); gets lower with lower disparity; tried to avoid large jumps so points that are not close in disparity) Goal: find disparity image which minimizes the energy function Semi global matching: faster, Solution considering a limited number of directions: 1D optimization Steps: 1) Compute data term for all disparities and all image pixels 2) Consider different directions for each pixel 3) Compute energy functions for all combinations (adding smoothing term)(search for where E is minimum in 16 directions) 4) Choose best disparity considering minimum cost and the different directions Deep learning and 3D reconstructions Replacing different parts with neural networks, such as cross correlation with a CNN Deep Feature mapping: 2 image patches that are converted to vectors to be compared Siamese architecture: 2 streams Similarity can be defined by traditional method (feature mapping) or a fully connected decision network Variants developed that can consider global context without increasing computational loss: o Multi scale approach o Attention layers Sparse reconstruction: every correctly matched points lead to a single point in 3D Several forward passes are needed to select right patch (first branch: patch around pixel; second branch: patch over possible disparities) Training procedures Two main options: 1) Supervised training: a. taking one example at a time and adapting measure b. considers positive and negative patches and uses false positives to improve performance c. requires large labelled training set which is annoying to get for real applications d. minimize matching loss: measures discrepancy between ground truth and prediction for each sample 2) Weakly supervised learning: a. Exploits one or more stereo constraints to reduce manual labelling: i. Epipolar constraint ii. Disparity range constraint iii. Uniqueness constraint iv. Ordering constraint Multi instance learning: positive match must be better than the best negative batch Contractive loss: best match must be much greater than the second best match End-to-end 3D reconstruction: Used for dense reconstruction Using 2 images by stacking them a depth map is immediately inferred/put out Huge amount of training samples needed Other approaches: divide work in different steps of traditional matching 1) Feature learning 2) Cost volume (can be 3D or 4D(row, column, feature dimension, disparity) estimation and regularization i. 3D: regularization using 2D or 3D convolutions): For each pixel in the reference image (e.g., left image), compute the cost of matching it to pixels at various disparities in the target image (e.g., right image); Costs are computed using similarity measures such as Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), or more complex functions. 3D is used for stereo vision ii. 4D is used for multiple views not just one stereo pair (no info about feature similarities) 3) Disparity estimation: disparity map is estimated from regularized cost volume by using pixel wise argmin 4) Post-processing and refinement 5) Learning confidence and uncertainty Convolution: process where two functions are combined to a third one A filter/kernel is applied to an image to produce new image where each pixel value is weighted sum of neighbours Noise reduction: lack of training data as big challenge Quality of 3D reconstruction can be checked by swapping left and right image Supervised methods: Loss functions: Try to minimize difference between ground truth and estimated depths Self-supervised methods: Rely on image reconstruction losses Photometric map: learned feature map/image gradient Weakly supervised methods: Rely on confidence guide loss; use left-right consistency to assess confidence of reconstruction Single image depth estimation Why SIDE eye? Goal: retrieve depth information from a single image Used for monocular cameras Dynamic environments: Scene changes from t to t+1 Artificial illumination in dark environments Applications: Obstacle avoidance Robot navigation SIDE using CNN: RGB picture and depth info or for self-supervised image orientation are input and estimated depth is output Large training dataset is needed One to many mapping problem: several depth maps could explain an image Cues humans use to estimate depth from a single image: (basically what I did in art class) Occlusion Perspective Atmospheric cue (objects further away get blurry) Patterns of light and shadow Height cue Important elements of traditional methods: Working in log space: smaller depth are more accurate than bigger depths Using spatial coordinates Incorporating global context: objects with known size can help to estimate the scale of the scene Using relative instead of absolute depth Training with synthetic data can be very useful First successful method to determine absolute depth: 3 sets of parameters have to be learned per row of the image cause each row is statistically different from each other: 1) Estimate absolute depth of a patch by looking at handcrafted features 2) Estimate uncertainty in absolute depth estimation 3) Smoothing term in model Other method: Using superpixel: Depth of all pixels belonging to a surface can be calculated by 3D location of surface Problem as supervised regression problem: Stacks coarse network and fine network on top of each other Coarse network: using global context of image (fully connected layers) Fine network: considers local parts starting from coarse result Other approaches see it as classification problem: Discretize space and produce confidence on depth estimation Ordinal regression problem: Multiple heads and each head solves a pixel wise binary classification which decides if it is closer or further from a threshold Feature extractor followed by scene understanding modular Feature maps from scene understanding module are fed into heads of network for final prediction SIDE problem can also be considered with similar tasks Limitations: Low transferability of the model to other locations Less accurate than stereo Complicated methods Conclusion after looking at networks in autonomous driving: Vertical cues more important than size Mimic pitch of camera change (using different vertical offsets) Objects need to be connected to the ground to be seen; colour is not relevant Network see relevant image edges Vanishing point and perspective are relevant

Document Details

Tags

Related

Full Transcript