Computer Vision: A Practical Introduction PDF
Document Details
Uploaded by FancyPalmTree2843
Tags
Related
- Week 2 - Unit I - Image Fundamentals - II PDF
- Week 2 - Unit I - Image Fundementals - II PDF
- Digital Image Processing Course Study Guide PDF
- Digital Image Processing Course Study Guide PDF
- Digital Image Processing Course Study Guide PDF
- Introduction to Computer Vision - Fall 2024 - Toronto Metropolitan University PDF
Summary
This document provides a practical introduction to computer vision, focusing on the fundamentals of digital image processing. It details image representation, types, and processing techniques. The document emphasizes various stages of image processing, including acquisition, enhancement, and analysis.
Full Transcript
COMPUTER VISION A PRACTICAL INTRODUCTION TO COMPUTER VISION WITH OPEN CV 1 Digital Image Processing Representation of binary images on a computer is done using zeros and ones. Each image consists of rows and columns containing a set of pixels, where a pixel is the smallest unit in...
COMPUTER VISION A PRACTICAL INTRODUCTION TO COMPUTER VISION WITH OPEN CV 1 Digital Image Processing Representation of binary images on a computer is done using zeros and ones. Each image consists of rows and columns containing a set of pixels, where a pixel is the smallest unit in the image. The clarity of the image increases with a greater number of pixels. Digital Image Processing means processing digital image by means of a digital computer. We can also say that it is a use of computer algorithms, in order to get enhanced image either to extract some useful information. Digital image processing is the use of algorithms and mathematical models to process and analyze digital images. The basic steps involved in digital image processing are: 1. Image acquisition: This involves capturing an image using a digital camera or scanner, or importing an existing image into a computer. 2. Image enhancement: This involves improving the visual quality of an image, such as increasing contrast, reducing noise, and removing artifacts. 3. Image restoration: This involves removing degradation from an image, such as blurring, noise, and distortion. 4. Image segmentation: This involves dividing an image into regions or segments, each of which corresponds to a specific object or feature in the image. 5. Image representation and description: This involves representing an image in a way that can be analyzed and manipulated by a computer, and describing the features of an image in a compact and meaningful way. 6. Image analysis: This involves using algorithms and mathematical models to extract information from an image, such as recognizing objects, detecting patterns, and quantifying features. 7. Image synthesis and compression: This involves generating new images or compressing existing images to reduce storage and transmission requirements. 8. Digital image processing is widely used in a variety of applications, including medical imaging, remote sensing, computer vision, and multimedia. Image processing mainly includes the following steps: 1. Importing the image via image acquisition tools; 2. Analyzing and manipulating the image; 3. Output in which result can be altered image or a report which is based on analyzing that image. What is an image? An image is defined as a two-dimensional function (x, y), where x and y are spatial coordinates, and the amplitude of F at any pair of coordinates (x, y) is called the intensity of that image at that point. When x, y, and amplitude values of F are finite, we call it a digital image. In other words, an image can be defined by a two-dimensional array specifically arranged in rows and columns. Digital Image is composed of a finite number of elements, each of which elements have a particular value at a particular location. These elements are referred to as picture elements, image elements, and pixels. A Pixel is most widely used to denote the elements of a Digital Image. Types of an Image 1. BINARY IMAGE – The binary image, as its name suggests, contains only two pixel elements, i.e., 0 & 1, where 0 refers to black and 1 refers to white. This image is also known as Monochrome. 2. BLACK AND WHITE IMAGE – The image which consists of only black and white colors is called a BLACK AND WHITE IMAGE. 3. 8-bit COLOR FORMAT – It is the most famous image format. It has 256 different shades of colors in it and is commonly known as 3 Grayscale Image. In this format, 0 stands for Black, and 255 stands for White, and 127 stands for Gray. 4. 16-bit COLOR FORMAT – It is a color image format. It has 65,536 different colors in it. It is also known as High Color Format. In this format, the distribution of color is not the same as in Grayscale images. A 16-bit format is actually divided into three further formats which are Red, Green, and Blue. That famous RGB format. PHASES OF IMAGE PROCESSING 1. ACQUISITION – It could be as simple as being given an image in digital form. The main work involves: a) Scaling b) Color conversion (RGB to Gray or vice-versa) 2. IMAGE ENHANCEMENT – It is among the simplest and most appealing areas of Image Processing. It is also used to extract some hidden details from an image and is subjective. 3. IMAGE RESTORATION – It also deals with the appeal of an image but it is objective (Restoration is based on a mathematical or probabilistic model or image degradation). 4. COLOR IMAGE PROCESSING – It deals with pseudo-color and full-color image processing; color models are applicable to digital image processing. 5. WAVELETS AND MULTI-RESOLUTION PROCESSING – It is the foundation of representing images in various degrees. 6. IMAGE COMPRESSION – It involves developing some functions to perform this operation. It mainly deals with image size or resolution. 7. MORPHOLOGICAL PROCESSING – It deals with tools for extracting image components that are useful in the representation & description of shape. 8. SEGMENTATION PROCEDURE – It includes partitioning an image into its constituent parts or objects. Autonomous segmentation is the most difficult task in Image Processing. 9. REPRESENTATION & DESCRIPTION – It follows the output of the segmentation stage; choosing a representation is only part of the solution for transforming raw data into processed data. Advantages of Digital Image Processing: 1. Improved image quality: Digital image processing algorithms can improve the visual quality of images, making them clearer, sharper, and more informative. 2. Automated image-based tasks: Digital image processing can automate many image-based tasks, such as object recognition, pattern detection, and measurement. 3. Increased efficiency: Digital image processing algorithms can process images much faster than humans, making it possible to analyze large amounts of data in a short amount of time. 4. Increased accuracy: Digital image processing algorithms can provide more accurate results than humans, especially for tasks that require precise measurements or quantitative analysis. Disadvantages of Digital Image Processing: 1. High computational cost: Some digital image processing algorithms are computationally intensive and require significant computational resources. 2. Limited interpretability: Some digital image processing algorithms may produce results that are difficult for humans to interpret, especially for complex or sophisticated algorithms. 3. Dependence on quality of input: The quality of the output of digital image processing algorithms is highly dependent on the quality of the input images. Poor quality input images can result in poor quality output. 4. Limitations of algorithms: Digital image processing algorithms have limitations, such as the difficulty of recognizing objects in cluttered or poorly lit scenes, or the inability to recognize objects with significant deformations or occlusions. 5. Dependence on good training data: The performance of many digital image processing algorithms is dependent on the quality of the training data 5 used to develop the algorithms. Poor quality training data can result in poor performance of the algorithm. Difference between Image Processing and Computer Vision Image processing and Computer Vision both are very exciting fields of Computer Science. Image processing and Computer Vision are both very exciting fields of Computer Science. Computer Vision: In Computer Vision, computers or machines are made to gain high-level understanding from the input digital images or videos with the purpose of automating tasks that the human visual system can do. It uses many techniques and Image Processing is just one of them. Image Processing: Image Processing is the field of enhancing the images by tuning many parameters and features of the images. So, Image Processing is the subset of Computer Vision. Here, transformations are applied to an input image and the resultant output image is returned. Some of these transformations are sharpening, smoothing, stretching etc. Now, as both the fields deal with working in visuals, images and videos, there seems to be a lot of confusion about the difference between these fields of computer science. In this article, we will discuss the difference between them The stages of image analysis processing can be outlined as follows: 1. Pre-processing This stage is used to identify and remove noise (such as dots, speckles, and scratches) and irrelevant visual information that does not affect the results of the areas to be processed later. 2. Data Reduction: This stage is used to reduce the data in the spatial domain and transfer the result to another place called the frequency domain. We record properties (frequency domain = spatial domain) of the analysis used for processing. 7 Primary processing is divided into sections: 1. Image engineering for a specific internal region or a particular internal copy will use the derived features of a specific region called ROI, where certain operations are adjusted by spatial coordinates used in image engineering processes, including (Group) or (Zoom) for enlargement, reduction, or transfer rotation. Subsequently, a partial image is obtained for further processing. 1- Zoom Process Method: 1. There are methods for processing (Zoom), and the first method is the Zero-Order-Hold method, which involves repeating the pixel values of previous rows and columns, for example cleaning up a row to zoom in on rows or adding rows and columns simultaneously to enlarge the matrix or column. Example// You have the following part of the required image: 2. Zoom it using the (Zero-Order-Hold) row by row method. 3. Zoom it using the (Zero-Order-Hold) column by column method 4. Zoom it using the (Zero-Order-Hold) row and column method. Solution: 1. The output will be a matrix of size 3×6. 40 40 20 20 10 10 70 70 50 50 30 30 90 90 80 80 10 10 2. The result will be a matrix with a size of 3×6. 40 20 10 40 20 10 70 50 30 70 50 30 90 80 10 90 80 10 3. The output is a 3×6 matrix 40 40 20 20 10 10 40 40 20 20 10 10 70 70 50 50 30 30 70 70 50 50 30 30 90 90 80 80 10 10 90 90 80 80 10 10 How to find the average: Finding the average between two adjacent pixel values and putting the value between them, such as 4-8. We add it = 12. We divide by 2, so the average value becomes 6. The result is written as 4-6-8. If we use this method by averaging row pixels, the columns will increase, and if we use columns, the rows will increase. 9 We can work with two pixels in each row and each column, and we can expand the columns and rows together: This method enlarges the capacity of the N*N matrix to become an image matrix of size (2n-1-2n-1). Example If we have a 3×3 matrix that represents part of the values of the digital image, we need to expand the columns and rows together. Solution: The size of the matrix becomes 5 x 5 Example for clarification: We have the following matrix on which the rows and columns will be expanded together. Expand The rows and columns are operated on the resulting matrix, not the original. 4. Zoom using factor (k): This means, for example, that the image (matrix) is enlarged, for example, 3 times its size. This means that the factor K=3 multiplies it by the capacity of the matrix. If what is required is to enlarge a matrix (part of an image), enlarge it three or four times or something else. We use what is called the k factor and we do the following: 1. Subtract the value of each of two adjacent values. 2. Divide the result by the magnification factor (K). 3. Add the result to the smallest value and continue adding to all elements by (k-1). 4. Apply these steps to rows and columns. Example: You have a portion of the following image that you want to enlarge by 3 times its original size Solution: We take each of two adjacent values, subtract the smaller one from the larger one, then divide the result by 3, then add the result of the division to the smallest value, and it becomes 11 That is, we add 5 twice, and the result becomes two numbers between 125 and 140 [ 125 130 135 140 ] Then we take the other two adjacent numbers, which are 140 and 155 The matrix becomes as follows: Computer vision modeling Image Algebra Algebraic operations are divided into mathematical operations (arithmetic operations) and logical operations. Mathematical Calculations: Addition: The addition process is used to collect information from two images by combining the elements of the first image with the second, starting with the first element of the first image with the first element of the second image, and so on for the rest of the elements. We use the addition method to restore or number the image. Image Restoration and to add noise to the image (as a type of encryption). Example: You have parts of the following two images. The first image is I1 and the second image is I2. What is required to add these two parts? Solution Example: If the plural is used for Noise, what is the way to break the two images back? Answer: We will depend on the result, so we subtract one of the two matrices from the result. The matrix resulting from the subtraction in the noise matrix starts from the target and we penetrate the target’s probabilities using the probabilities of the search space. Subtraction The subtraction process is used to subtract information from two images, so that we subtract every element in the first image. Example: You have the following two images. You need to subtract the two images? 13 Solution: Multiplication The process is done by multiplying the matrix elements of the image by a factor greater than one and is used. Increase or shrink the image For example: The factor K must be greater than one when you want to enlarge the image Example: You have the following image to scale up and down using one of the digital image jigsaw operations. Answer: Using the multiplication process: 1- Increase it 2- Decrease it We use multiplication as algebraic mathematical operations, for example, we multiply this matrix by a factor (here we choose the factor now, it was not specified for us in the question) The coefficient K=3 is greater than one if it increases If the image is reduced, multiply by 3-less than one. 15 Example: we have the matrix (part of an image), which is the result of the increment process, and (K=3) required to find the original matrix? Answer: There are two ways to solve: 1- We divide the matrix by the factor K, where each value in the resulting matrix is divided by Parameter (3) produces the original matrix. 2- In terms of the decrease, such that you use the factor K less than one, for example, K=1/3. Note: 1 < K if increased makes the image tend to be whiter 1 > K in case of decrease in (shrinking) and here the image tends to black (darkness) Division: - The elements of the given image are divided by a factor greater than one. The division process makes the image dark. - Such as // You have the following matrix, which is part of an image. You need to divide the image by a factor of K=4 Solution Logical operations: Logical AND operation: Logical operations are applied to the elements of the image after converting each element of the image to the binary state, so that logical operations can be used in it through the (ROI) method. AND is considered similar to the multiplication process, meaning that the image tends to be white and is done through a white square with Image elements so that the output is the part of the image corresponding to the white square. (AND: makes the background of the part we want white, while OR makes the background of the part we want black) Logical OR operation it is done by taking a black square and a white background for the required image data from the original image, and the OR process is similar to the addition process. The logical operation NOT It is used to give negative values to the original image, meaning it deviates the opposite of the image (e.g., negative camera film). 17 That is, the image data is reversed, i.e., black becomes white and white becomes black Example: If you have the following image part to use NOT? The image resulting from the NOT process is close to black, and the data of this image must be converted to (Binary) (0,1) format. Example: Apply an AND gate operation to an element of the image so that the first element is 88 and the second element is 111 Solution: We convert the 88 to the binary form (1,0) so that: In the case of NOT, it is for one of the two numbers, so that every zero becomes a one and every one is a zero. First Number Second Number Note: The values here are distorted, so I used a second method to deal with these parameters for logical operations and convert them to binary, and also for gates. (NAND, NOR, XOR) 19 Image enhancement (spatial filters) Filter means a process that filters the image from any remaining impurities, that is, it highlights the features of the part of the image that we want by removing noise and impurities. We use spatial filters to remove noise or improve the image, as these filters apply directly in the image field (directly on the image elements) and not in the frequency domain (transformation), where the image elements are used using one of the transformations such as the Fourier transform and the cos transform. Filters are divided into three types: 1- Mean Filter 2- Median Filter 3- Enhancement Filter The first and second types are used to remove noise, in addition to some applications that give. The form of smoothing the image, 1- Removing noise 2- Smoothing The third type is used to clarify the edges and details in the image, where spatial filters are applied, either by using the elements directly without using and your name, or by using a wrap mask with the elements and their neighbors. The results of the mask can be known as follows: 1. If the sum of the mask’s parameters equals 1, it means high illumination of the image. 2. If the sum of the coefficients is equal to 0, then the image illumination loses, that is, it tends to become black. 3. If the coefficients are negative and wavelike, this means information about the edges. 1- If the coefficients are only waves, there is some kind of distortion in the image. 21 1- Mean Filter - It is a linear filter whose elements are: All of its elements are positive, and because they are all positive, there is distortion in the image, and since the sum of the mask’s elements equals 1, then there is a high loss. Example: Apply the Mean mask to the following image: Solution: We know that the result is two points, so we connect them and the shape becomes linear. 2- -Median Filter It is a nonlinear filter that acts on image elements immediately after selecting a mask through elements where the center of the image is replaced by the value in the middle. Example: Apply Median Filter to the following image part? There is no masker in Median, and we create a masker for it from the elements of the matrix, where we take elements The image and we make a mask in it, and that is in the order of the image elements in ascending order, so it becomes: 1- The first step: Arrange the items in ascending order 2- The second step: We divide the number of elements by 2 to extract the middle location. 3- Step Three: We see what the value of the fifth position is The value of the fifth position is 5 It is not necessary that the value be equal to the intermediate position -4 The fourth step: We change the middle location in the matrix, which is the fifth location (4), which is “equal to the 23 value 5 instead of the middle element in the original matrix, which is 4 (that is, we put 5 instead of 4), and it becomes Example: You have the following image fragment Solution: 1- We take the part (3*3) and arrange it in ascending order 2- We select the element in the middle 3- We find the value in the middle 3=5 4- We replace the element in the middle and write the matrix We take the next part of the matrix, which is also (3*3), and arrange it in ascending order. Enhancement Filter Image Quantization: - The difference between compression and shrinkage: Image shrinkage is the process of transferring image data by removing some image information by projecting a group of image elements to a single point. This process of shrinkage (quantization) takes place. - As for compression: we deal with the image itself as a file, while shrinking may delete part of the image and we deal with the values of the image. - There are two ways to reduce the image: 1- Gray Level Reduction - That is, we reduce the color levels of the image, and here it is done on I(r,c). 2- Special Reduction - Here, work is done on the coordinates of the image elements (r, c), the location, for example (1,1). 1- Gray Level Reduction - We can explain its sections as follows: Reducing the gray level consists of three points: 25 A- The first method: Threshold A specific value of color levels is chosen. This value is called a threshold. Any value of image data that is higher than the threshold value becomes one, and if it is lower, its value becomes zero. This means that the number with 256 color levels is converted into binary images. like If the threshold value is 127, apply it to the following values: Solution: - We see that the highest value is 251 and the lowest value is 11 according to the binary model, the threshold is 127, so it is: Example: If the threshold value is 127, apply it to the following values: Solution: All of this value is less than the threshold of 127, so it is zero. In this case, we determine the largest and the least, and we take the middle, so the smallest value is 2, and the largest value is 25, so the middle between them is 12 or 13. B- The second method: It is the OR, AND process, not using the mask Here the bits per pixel are reduced. Example: We want to reduce or reduce the information to eight. If the standard probability for us is 256, for the gray level to 32 levels, we use the AND method to explain this. The smallest number of each place is in AND. Solution: So, these color spectra must be reduced from 256 to 32 spectra, meaning that every 8 bits we put in a cell, after which we take the lowest value in each cell, so here the first cell will have the lowest value, 0, the second, 8, and the third, 16, until it reaches 256, so that the number of these extracted numbers is 32 numbers. But if the OR method takes the largest number in the box, why? Answer: Because R does not take a zero, and the numbers are less than AND by one number starting from the number The second) becomes.............. 32 15 7 to OR takes the largest number from each digit Example: If you have the number of standard levels (256) do you want to reduce it to 16? 27 Answer: This means that it is all 16 bits in one place, so the division is by 16, so we start with the first place from 0 to 16, and so on. C- The third method is ANE and OR using the mask This method is used to shrink the image (reduce it) using a specific mask. Example // If you have diamonds, the following is required to use the AND catcher method to shrink this part of the image depending on the number of bits for each element, which is 8? Solution: The law of gray levels Any 8 comes from (0, 7) In binary, the 7 becomes (111), so the masker is 8 bits and equals 00000111. Then we take the number of a number from the matrix and convert it to binary form and multiply it by the masker, so you get the number zero, which in binary is 00000000 After hitting it with musk, it will be: And 10, which we convert to binary, is 1010 As for the number 255, it is And so on for the rest of the numbers in the matrix 29 2- Special Reduction Reducing space is done in three ways: 1- Average 2- Median 3- Reduction The first method: the rate This is a method that takes a group of adjacent elements and takes their average. Example: It is required to use the rate method for the following image part Solution: If the overall image rate is: Total Average of Image = (33 +17)/4= 13 If the average is for rows, it is the sum of the row divided by the number of numbers in one row. The second method: the mediator In this method, the image elements are arranged in ascending order and the value in the middle is taken. Example: You have the following array. What is required: 1- Taking the median of all elements 2- Using a specific mask? - We arrange the numbers in ascending order, so they become: - If the median is the sixth element, i.e. - If the masker is used and the masker is assumed to be 3*3 and we arrange the elements in ascending order, the order of the elements of the first matrix will be 3*3. So, the median is the fifth element and its value is 5 The order of the elements of the second matrix is 3*3 So, the median is the fifth element and its value is also 5 The third method is reduction 31 - Some image data is deleted, for example, the image size is reduced by 2. Here, each row is taken or a column from the image and delete the row and the next column/ Example: You have the following image part that needs to be reduced by 2 by columns? So, the second and fourth columns are deleted, meaning the matrix becomes as follows: \ If the reduction is by 3, we delete two columns, the second and third, meaning 2 = 1 - 3. So, the matrix becomes a sequence In this case, if what is required is to reduce by 3 rows, the answer is not possible because the matrix consists of 3 rows, not Histogram modification - A chart that uses the gray levels of an image distributes these levels of the image so that the part of the image that contains the information fills the chart and the rest of the space is empty, depending on the values of the chart's image points. There are many of these modified levels that we can mention as follows: 1. Histogram with a small spread of low contrast levels 2. Image. 3. Histogram with a large spread of High Contrast gray levels 4. Image. 5. Histogram clustered at the low-end Dark Slide Image 6. Histogram clustered at the top end White Slide Image The process of changing the histogram is done in three ways: Histogram Stretching Histogram Shrink (compressed) Slide of Histogram The first method: Histogram Stretching 33 The histogram can be expanded according to the following law: whereas: 1. The largest gray level value in the image I (r,c)max 2. The minimum gray level value in the image is I(r,c) min 3. The possible minimum and maximum gray level values depend on (255.0) Max & Min Example: You have the following image part to expand this part of the image using the histogram expansion method? The second method is histogram shrinking The histogram can be reduced according to the following law: 35 whereas: 1. The largest gray level value in the image is I(r,c)max 2. The minimum gray level value in the image is I(r,c)man 3. Depends on the maximum and minimum Shrinkmax & Shrinkmin gray level values the potential is (0, 255). Example: You have the following image part to shrink this part of the image using method Shrink histogram? The third method: Shift the histogram slide The histogram can be shifted by a certain distance according to the following law: Slide (I (r,c) ) = I (r,c) - OFFSET.................(8) whereas: Offset: The amount by which the histogram is offset by a distance. Example: You have the following part of the image that needs to be shifted by a distance of 10 units using the Histogram slide method. 37 Introduction Computer vision is the automatic analysis of images and videos by computers in order to gain some understanding of the world. Computer vision is inspired by the capabilities of the human vision system and, when initially addressed in the 1960s and 1970s, it was thought to be a relatively straightforward problem to solve. However, the reason we think/thought that vision is easy is that we have our own visual system which makes the task seem intuitive to our conscious minds. In fact, the human visual system is very complex and even the estimates of how much of the brain is involved with visual processing vary from 25% up to more than 50%. 1.1 A Difficult Problem The first challenge facing anyone studying this subject is to convince them that the problem is difficult. To try to illustrate the difficulty, we first show three different versions of the same image in Figure 1.1. For a computer, an image is just an array of values, such as the array shown in the left-hand image in Figure 1.1. For us, using our complex vision system, we can perceive this as a face image but only if we are shown it as a grey scale image (top right). Computer vision is quite like understanding the array of values shown in Figure 1.1, but is more complicated as the array is really much bigger (e.g. to be equivalent to the human eye a camera would need around 127 million elements), and more complex (i.e. with each point represented by three values in order to encode colour information). To make the task even more convoluted, the images are constantly changing, providing a stream of 50–60 images per second and, of course, there are two streams of data as we have two eyes/cameras. 39 67 67 66 68 66 67 64 65 65 63 63 69 61 64 63 66 61 60 69 68 63 68 65 62 65 61 50 26 32 65 61 67 64 65 66 63 72 71 70 87 67 60 28 21 17 18 13 15 20 59 61 65 66 64 75 73 76 78 67 26 20 19 16 18 16 13 18 21 50 61 69 70 74 75 78 74 39 31 31 30 46 37 69 66 64 43 18 63 69 60 73 75 77 64 41 20 18 22 63 92 99 88 78 73 39 40 59 65 74 75 71 42 19 12 14 28 79 102 107 96 87 79 57 29 68 66 75 75 66 43 12 11 16 62 87 84 84 108 83 84 59 39 70 66 76 74 49 42 37 10 34 78 90 99 68 94 97 51 40 69 72 65 76 63 40 57 123 88 60 83 95 88 80 71 67 69 32 67 73 73 78 50 32 33 90 121 66 86 100 116 87 85 80 74 71 56 58 48 80 40 33 16 63 107 57 86 103 113 113 104 94 86 77 48 47 45 88 41 35 10 15 94 67 96 98 91 86 105 81 77 71 35 45 47 87 51 35 15 15 17 51 92 104 101 72 74 87 100 27 31 44 46 86 42 47 11 13 16 71 76 89 95 116 91 67 87 12 25 43 51 96 67 20 12 17 17 86 89 90 101 96 89 62 13 11 19 40 51 99 88 19 15 15 18 32 107 99 86 95 92 26 13 13 16 49 52 99 77 16 14 14 16 35 115 111 109 91 79 17 16 13 46 48 51 Figure 1.1 Different versions of an image. An array of numbers (left) which are the values of the grey scales in the low-resolution image of a face (top right). The task of computer vision is most like understanding the array of numbers 1.2 The Human Vision System If we could duplicate the human visual system then the problem of developing a computer vision system would be solved. So why can’t we? The main difficulty is that we do not understand what the human vision system is doing most of the time. If you consider your eyes, it is probably not clear to you that your colour vision (provided by the 6–7 million cones in the eye) is concentrated in the centre of the visual field of the eye (known as the macula). The rest of your retina is made up of around 120 million rods (cells that are sensitive to visible light of any wavelength/colour). In addition, each eye has a rather large blind spot where the optic nerve attaches to the retina. Somehow, we think we see a continuous image (i.e. no blind spot) with colour everywhere, but even at this lowest level of processing it is unclear as to how this impression occurs within the brain. The visual cortex (at the back of the brain) has been studied and found to contain cells that perform a type of edge detection (see Chapter 6), but mostly we know what sections of the brain do based on localised brain damage to individuals. For example, a number of people with damage to a particular section of the brain can no longer recognise faces (a condition known as prosopagnosia). Other people have lost the ability to sense moving objects (a condition known as akinetopsia). These conditions inspire us to develop separate modules to recognise faces (e.g. see Section 8.4) and to detect object motion (e.g. see Chapter 9). We can also look at the brain using functional MRI, which allows us to see the concentration of electrical activity in different parts of the brain as subjects perform various activities. Again, this may tell us what large parts of the brain are doing, but it cannot provide us with algorithms to solve the problem of interpreting the massive arrays of numbers that video cameras provid 41 1.3 Practical Applications of Computer Vision Computer vision has many applications in industry, particularly allowing the automatic inspection of manufactured goods at any stage in the production line. For example, it has been used to: Inspect printed circuits boards to ensure that tracks and components are placed correctly. See Figure 1.2. Inspect print quality of labels. See Figure 1.3. Inspect bottles to ensure they are properly filled. See Figure 1.3. Figure 1.2 PCB inspection of pads (left) and images of some detected flaws in the surface mounting of components (right). Reproduced by permission of James Mahon Figure 1.3 Checking print quality of best-before dates (right), and monitoring level to which bottles are filled (right). Reproduced by permission of Omron Electronics LLC Guide robots when manufacturing complex products such as cars. On the factory floor, the problem is a little simpler than in the real world as the lighting can be constrained and the possible variations of what we can see are quite limited. Computer vision is now solving problems outside the factory. Computer vision applications outside the factory include: The automatic reading of license plates as they pass through tollgates on major roads. Augmenting sports broadcasts by determining distances for penalties, along with a range of other statistics (such as how far each player has travelled during the game). Biometric security checks in airports using images of faces and images of fingerprints. See Figure 1.4. Augmenting movies by the insertion of virtual objects into video sequences, so that they appear as though they belong (e.g. the candles in the Great Hall in the Harry Potter movies). 30.8 30.0 29.1 28.3 27.5 26.6 25.8 25.0 24.2 23.3 22.5 °C Figure 1.4 Buried landmines in an infrared image (left). Reproduced by permission of Zouheir Fawaz, Handprint recognition system (right). Reproduced by permission of Siemens AG Assisting drivers by warning them when they are drifting out of lane. Creating 3D models of a destroyed building from multiple old photographs. 43 Advanced interfaces for computer games allowing the real time detection of players or their hand-held controllers. Classification of plant types and anticipated yields based on multispectral satellite images. Detecting buried landmines in infrared images. See Figure 1.4. Some examples of existing computer vision systems in the outside world are shown in Figure 1.4. 1.4 The Future of Computer Vision The community of vision developers is constantly pushing the boundaries of what we can achieve. While we can produce autonomous vehicles, which drive themselves on a highway, we would have difficulties producing a reliable vehicle to work on minor roads, particularly if the road marking were poor. Even in the highway environment, though, we have a legal issue, as who is to blame if the vehicle crashes? Clearly, those developing the technology do not think it should be them and would rather that the driver should still be responsible should anything go wrong. This issue of liability is a difficult one and arises with many vision applications in the real world. Taking another example, if we develop a medical imaging system to diagnose cancer, what will happen when it mistakenly does not diagnose a condition? Even though the system might be more reliable than any individual radiologist, we enter a legal minefield. Therefore, for now, the simplest solution is either to address only non-critical problems or to develop systems, which are assistants to, rather than replacements for, the current human experts. Another problem exists with the deployment of computer vision systems. In some countries the installation and use of video cameras is considered an infringement of our basic right to privacy. This varies hugely from country to country, from company to company, and even from individual to individual. While most people involved with technology see the potential benefits of camera systems, many people are inherently distrustful of video cameras and what the videos could be used for. Among other things, they fear (perhaps justifiably) a Big Brother scenario, where our movements and actions are constantly monitored. Despite this, the number of cameras is growing very rapidly, as there are cameras on virtually every new computer, every new phone, every new games console, and so on. Moving forwards, we expect to see computer vision addressing progressively harder problems; that is problems in more complex environments with fewer constraints. We expect computer vision to start to be able to recognise more objects of different types and to begin to extract more reliable and robust descriptions of the world in which they operate. For example, we expect computer vision to become an integral part of general computer interfaces; provide increased levels of security through biometric analysis; provide reliable diagnoses of medical conditions from medical images and medical records; allow vehicles to be driven autonomously; automatically determine the identity of criminals through the forensic analysis of video. 45 Figure 1.5 The ASIMO humanoid robot which has two cameras in its ‘head’ which allow ASIMO to determine how far away things are, recognise familiar faces, etc. Reproduced by permission of Honda Motor Co. Inc Ultimately, computer vision is aiming to emulate the capabilities of human vision, and to provide these abilities to humanoid (and other) robotic devices, such as ASIMO (see Figure 1.5). This is part of what makes this field exciting, and surprising, as we all have our own (human) vision systems which work remarkably well, yet when we try to automate any computer vision task it proves very difficult to do reliably. 2 Images Images play a crucial role in computer vision, serving as the visual data captured by devices like cameras. They represent the appearance of scenes, which can be processed to highlight key features before extracting information. Images often contain noise, which can be reduced using basic image processing methods. 2.1 Cameras A camera includes a photosensitive image plane that detects light, a housing that blocks unwanted light, and a lens that directs light onto the image plane in a controlled manner, focusing the light rays. 47 48 2.1.1The Simple Pinhole Camera Model The pinhole camera model is a basic yet realistic representation of a camera, where the lens is considered a simple pinhole through which all light rays pass to reach the image plane. This model simplifies real imaging systems, which often have distortions caused by lenses. Adjustments to address these distortions are discussed in Section 5.6. Y X I Z J Image plane Focal Figure 2.1 illustrates the pinhole camera model, demonstrating how the 3D real world (right side) relates to images on the image plane (left side). The pinhole serves as the origin in the XYZ coordinate system. In practice, the image plane needs to be enclosed in a housing to block stray light. In homogeneous coordinates, www acts as a scaling factor for image points. fif_ifi and fjf_jfj represent a combination of the camera's focal length and pixel sizes in the I and J directions. (ci,cj)(c_i, c_j)(ci,cj) are the coordinates where the optical axis, a line perpendicular to the image plane passing through the pinhole, intersects the image plane. 2.2 Images An image is a 2D projection of a 3D scene captured by a sensor, represented as a continuous function of two coordinates (i, j), (column, row), or (x, y). For digital processing, the image needs to be converted into a suitable format. To process an image digitally, it is sampled into a matrix with MMM rows and NNN columns and then quantized, assigning each matrix element an integer value. The continuous range is divided into intervals, commonly k=256k = 256k=256. 2.2.1 Sampling Digital images are formed by sampling a continuous image into discrete elements using a 2D array of photosensitive elements (pixels). Each pixel has a fixed photosensitive area, with non-photosensitive borders between them. There is a small chance that objects could be missed if their light falls only in these border areas. A bigger challenge with sampling is that each pixel represents the average luminance or chrominance over an area, which might include light from multiple objects, especially at object boundaries. The number of samples in an image determines the ability to distinguish objects within it. A sufficient resolution (number of pixels) is crucial for accurately recognizing objects. However, if the resolution is too high, it may include unnecessary details, making processing more difficult and slower. 2.2.2Quantization Each pixel in a digital image f(i,j)f(i, j)f(i,j) represents scene brightness as a continuous function. However, these brightness values must be discretely represented using digital values. Typically, the number of brightness levels per channel is k=2bk = 2^bk=2b, where bbb is the number of bits, commonly set to 8. Figure 2.2 Four different samplings of the same image; top left 256x192, top right 128x96, bottom left 64x48 and bottom right 32x24 49 50 The essential question is how many bits are truly needed to represent pixels. Using more bits increases memory requirements, while using fewer bits results in information loss. Although 8-bit and 6-bit images appear similar, the latter uses 25% fewer bits. However, 4-bit and 2-bit images show significant issues, even if many objects can still be recognized. The required bit depth depends on the intended use of the image. For automatic machine interpretation, more quantization levels are necessary to avoid false contours and incorrect segmentation, as seen in lower-bit images. Figure 2.3 Four different quantization of the same grey-scale image; top left 8 bits, top right 6 bits, bottom left 4 bits and bottom right 2 bits