Video Analysis Algorithms in Computer Vision PDF

Video Analysis Algorithms in Computer Vision Obstacle tracking & video analysis The task of video surveillance involves two kind of algorithms: 1. Object tracking ◦ Optical Flow ◦ Visual Object Tracking (VOT) ◦ Multiple Object Tracking (MOT) 2. Action classification ◦ Action Classification with Machine Learning (End-To-End) ◦ Pose Estimation Object Tracking Video is a set of frames, it could be: ◦ video stream (live image feed) ◦ video sequence (fixed-length video) Videos take up a lot of storage space and are usually not already using AI. This means that, with video, we simply have raw image data to work with. The set of frame is need for characterise motion. Motion is the only difference between an image and a video. It’s a powerful thing to track and can lead to action understanding, pose estimation, or movement tracking. Optical Flow In video analysis, this key problem is called optical flow estimation. Optical flow is the idea of computing a pixel shift between two frames. This is handled as a correspondence problem, as illustrated in the following image: The output optical flow is a vector of movement between frame 1 and frame 2. It looks like this: A lot of existing datasets address the optical flow problem, such as KITTI Vision Benchmark Suite or MPI Sintel. They both contain ground truth optical flow data, which is generally hard to get from more popular datasets. To solve the optical flow problem, convolutional neural networks can help. CNN - FlowNet FlowNet is an example of a CNN designed for optical flow tasks, and it can output the optical flow from two frames. Optical flow is often represented by colours. Example: https://www.youtube.com/watch?v=k_wkDLJ8lJE Visual Object Tracking (VOT) Visual Object Tracking (VOT) is : ▪tracking an object given its position in frame 1 ▪detection algorithm is not used 1)model free 2)just track moving object Visual Object Tracking (VOT) This is a very powerful technique, and it only uses computer vision. We don’t need a single neural network to do this. To summarize this process: We receive the initial object to track using a bounding box We compute a colour histogram of this object We compute the colour of the background (near the object) We remove the object colour from the total image We now have a colour-based object tracker Visual Object Tracking (VOT) The next step is to apply CNNs for this task We must distinguish two main models here: MDNet and GOTURN. ◦ An MDNet (Multi-Domain Net) tracker trains a neural network to distinguish between an object and the background. Play this link: https://www.youtube.com/watch?v=zYM7G5qd090. it shows that the tracker try to distinguish the objects and background by drawing a bounding boxes around the objects. ◦ GOTURN (Generic Object Tracking Using Regression Networks) works by using two neural networks and specifying the region to search. It can work at over 100 FPS, which is amazing for the task of video tracking. Play this link: https://www.youtube.com/watch?v=MMQzQW1Y4h0. You could see that the usage is similar to MDNet, just the science is different between the two. Multiple Object Tracking (MOT) The last family of trackers is referred to as multiple object tracking Unlike the other family of trackers (VOT), MOT is more long-term. We distinguish two variants: ◦ Detection-Based Tracking (know what are you tracking) ◦ Detection-Free Tracking (don’t know what are you tracking) Let’s consider Detection-Based Tracking. We have two tasks here: ◦ Object detection ◦ Object association ◦ Object association would mean that detected objects from an external detector are mapped with tracked objects which has been detected and is being tracked from previous frames. Action Classification Action classification is the second family of tasks involved in building computer vision-based surveillance systems. Once we know how many people we have in the store, and once we know what they’ve been doing, we must analyze their actions. Action classification depends directly on object detection and tracking—this is because we first need to understand a given situation or scene. Once we have that understanding, we can classify the actions inside the bounding box. First, we must choose the camera that sees them with the best angle. Some angles might be better than others. If we choose the correct camera every time—for example, the camera that shows a face—then we can be sure we have a workable image. Actions can be really simple, like walking, running, clapping, or waving. They can also be more complex, like making a sandwich, which involves a series of multiple actions (cutting bred, washing tomatoes, etc). Action Classification with Machine Learning (End-To-End) Action happens in a video, not an image. This means that we must send multiple frames to the CNN, which must then perform a classification task on what’s called a space-time volume. With an image, it’s hard enough to do object detection or classification due to the image size, its rotation, etc. In a video, it’s even more difficult. 1. Video can be decomposed into spatial and temporal components. 2. The spatial part, in the form of individual frame appearance, carries information about scenes and objects. 3. The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects. Pose Estimation Pose estimation is another deep learning technique used as a means for action classification. The process of pose estimation includes: ◦ Detecting keypoints (similar to facial landmarks) ◦ Tracking these keypoints ◦ Classifying the keypoints's movement

Video Analysis Algorithms in Computer Vision PDF

Document Details

Tags

Related

Summary

Full Transcript