Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-OReilly-Media-2019.pdf

SECOND EDITION Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems...

SECOND EDITION Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems Aurélien Géron Beijing Boston Farnham Sebastopol Tokyo Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron Copyright © 2019 Aurélien Géron. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest June 2019: Second Edition Revision History for the Early Release 2018-11-05: First Release 2019-01-24: Second Release 2019-03-07: Third Release 2019-03-29: Fourth Release 2019-04-22: Fifth Release See http://oreilly.com/catalog/errata.csp?isbn=9781492032649 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-03264-9 [LSI] Table of Contents Preface....................................................................... xi Part I. The Fundamentals of Machine Learning 1. The Machine Learning Landscape............................................. 3 What Is Machine Learning? 4 Why Use Machine Learning? 4 Types of Machine Learning Systems 8 Supervised/Unsupervised Learning 8 Batch and Online Learning 15 Instance-Based Versus Model-Based Learning 18 Main Challenges of Machine Learning 24 Insufficient Quantity of Training Data 24 Nonrepresentative Training Data 26 Poor-Quality Data 27 Irrelevant Features 27 Overfitting the Training Data 28 Underfitting the Training Data 30 Stepping Back 30 Testing and Validating 31 Hyperparameter Tuning and Model Selection 32 Data Mismatch 33 Exercises 34 2. End-to-End Machine Learning Project........................................ 37 Working with Real Data 38 Look at the Big Picture 39 iii Frame the Problem 39 Select a Performance Measure 42 Check the Assumptions 45 Get the Data 45 Create the Workspace 45 Download the Data 49 Take a Quick Look at the Data Structure 50 Create a Test Set 54 Discover and Visualize the Data to Gain Insights 58 Visualizing Geographical Data 59 Looking for Correlations 62 Experimenting with Attribute Combinations 65 Prepare the Data for Machine Learning Algorithms 66 Data Cleaning 67 Handling Text and Categorical Attributes 69 Custom Transformers 71 Feature Scaling 72 Transformation Pipelines 73 Select and Train a Model 75 Training and Evaluating on the Training Set 75 Better Evaluation Using Cross-Validation 76 Fine-Tune Your Model 79 Grid Search 79 Randomized Search 81 Ensemble Methods 82 Analyze the Best Models and Their Errors 82 Evaluate Your System on the Test Set 83 Launch, Monitor, and Maintain Your System 84 Try It Out! 85 Exercises 85 3. Classification............................................................. 87 MNIST 87 Training a Binary Classifier 90 Performance Measures 90 Measuring Accuracy Using Cross-Validation 91 Confusion Matrix 92 Precision and Recall 94 Precision/Recall Tradeoff 95 The ROC Curve 99 Multiclass Classification 102 Error Analysis 104 iv | Table of Contents Multilabel Classification 108 Multioutput Classification 109 Exercises 110 4. Training Models.......................................................... 113 Linear Regression 114 The Normal Equation 116 Computational Complexity 119 Gradient Descent 119 Batch Gradient Descent 123 Stochastic Gradient Descent 126 Mini-batch Gradient Descent 129 Polynomial Regression 130 Learning Curves 132 Regularized Linear Models 136 Ridge Regression 137 Lasso Regression 139 Elastic Net 142 Early Stopping 142 Logistic Regression 144 Estimating Probabilities 144 Training and Cost Function 145 Decision Boundaries 146 Softmax Regression 149 Exercises 153 5. Support Vector Machines.................................................. 155 Linear SVM Classification 155 Soft Margin Classification 156 Nonlinear SVM Classification 159 Polynomial Kernel 160 Adding Similarity Features 161 Gaussian RBF Kernel 162 Computational Complexity 163 SVM Regression 164 Under the Hood 166 Decision Function and Predictions 166 Training Objective 167 Quadratic Programming 169 The Dual Problem 170 Kernelized SVM 171 Online SVMs 174 Table of Contents | v Exercises 175 6. Decision Trees........................................................... 177 Training and Visualizing a Decision Tree 177 Making Predictions 179 Estimating Class Probabilities 181 The CART Training Algorithm 182 Computational Complexity 183 Gini Impurity or Entropy? 183 Regularization Hyperparameters 184 Regression 185 Instability 188 Exercises 189 7. Ensemble Learning and Random Forests..................................... 191 Voting Classifiers 192 Bagging and Pasting 195 Bagging and Pasting in Scikit-Learn 196 Out-of-Bag Evaluation 197 Random Patches and Random Subspaces 198 Random Forests 199 Extra-Trees 200 Feature Importance 200 Boosting 201 AdaBoost 202 Gradient Boosting 205 Stacking 210 Exercises 213 8. Dimensionality Reduction................................................. 215 The Curse of Dimensionality 216 Main Approaches for Dimensionality Reduction 218 Projection 218 Manifold Learning 220 PCA 222 Preserving the Variance 222 Principal Components 223 Projecting Down to d Dimensions 224 Using Scikit-Learn 224 Explained Variance Ratio 225 Choosing the Right Number of Dimensions 225 PCA for Compression 226 vi | Table of Contents Randomized PCA 227 Incremental PCA 227 Kernel PCA 228 Selecting a Kernel and Tuning Hyperparameters 229 LLE 232 Other Dimensionality Reduction Techniques 234 Exercises 235 9. Unsupervised Learning Techniques......................................... 237 Clustering 238 K-Means 240 Limits of K-Means 250 Using clustering for image segmentation 251 Using Clustering for Preprocessing 252 Using Clustering for Semi-Supervised Learning 254 DBSCAN 256 Other Clustering Algorithms 259 Gaussian Mixtures 260 Anomaly Detection using Gaussian Mixtures 266 Selecting the Number of Clusters 267 Bayesian Gaussian Mixture Models 270 Other Anomaly Detection and Novelty Detection Algorithms 274 Part II. Neural Networks and Deep Learning 10. Introduction to Artificial Neural Networks with Keras.......................... 277 From Biological to Artificial Neurons 278 Biological Neurons 279 Logical Computations with Neurons 281 The Perceptron 281 Multi-Layer Perceptron and Backpropagation 286 Regression MLPs 289 Classification MLPs 290 Implementing MLPs with Keras 292 Installing TensorFlow 2 293 Building an Image Classifier Using the Sequential API 294 Building a Regression MLP Using the Sequential API 303 Building Complex Models Using the Functional API 304 Building Dynamic Models Using the Subclassing API 309 Saving and Restoring a Model 311 Using Callbacks 311 Table of Contents | vii Visualization Using TensorBoard 313 Fine-Tuning Neural Network Hyperparameters 315 Number of Hidden Layers 319 Number of Neurons per Hidden Layer 320 Learning Rate, Batch Size and Other Hyperparameters 320 Exercises 322 11. Training Deep Neural Networks............................................ 325 Vanishing/Exploding Gradients Problems 326 Glorot and He Initialization 327 Nonsaturating Activation Functions 329 Batch Normalization 333 Gradient Clipping 338 Reusing Pretrained Layers 339 Transfer Learning With Keras 341 Unsupervised Pretraining 343 Pretraining on an Auxiliary Task 344 Faster Optimizers 344 Momentum Optimization 345 Nesterov Accelerated Gradient 346 AdaGrad 347 RMSProp 349 Adam and Nadam Optimization 349 Learning Rate Scheduling 352 Avoiding Overfitting Through Regularization 356 ℓ1 and ℓ2 Regularization 356 Dropout 357 Monte-Carlo (MC) Dropout 360 Max-Norm Regularization 362 Summary and Practical Guidelines 363 Exercises 364 12. Custom Models and Training with TensorFlow................................ 367 A Quick Tour of TensorFlow 368 Using TensorFlow like NumPy 371 Tensors and Operations 371 Tensors and NumPy 373 Type Conversions 374 Variables 374 Other Data Structures 375 Customizing Models and Training Algorithms 376 Custom Loss Functions 376 viii | Table of Contents Saving and Loading Models That Contain Custom Components 377 Custom Activation Functions, Initializers, Regularizers, and Constraints 379 Custom Metrics 380 Custom Layers 383 Custom Models 386 Losses and Metrics Based on Model Internals 388 Computing Gradients Using Autodiff 389 Custom Training Loops 393 TensorFlow Functions and Graphs 396 Autograph and Tracing 398 TF Function Rules 400 13. Loading and Preprocessing Data with TensorFlow............................. 403 The Data API 404 Chaining Transformations 405 Shuffling the Data 406 Preprocessing the Data 409 Putting Everything Together 410 Prefetching 411 Using the Dataset With tf.keras 413 The TFRecord Format 414 Compressed TFRecord Files 415 A Brief Introduction to Protocol Buffers 415 TensorFlow Protobufs 416 Loading and Parsing Examples 418 Handling Lists of Lists Using the SequenceExample Protobuf 419 The Features API 420 Categorical Features 421 Crossed Categorical Features 421 Encoding Categorical Features Using One-Hot Vectors 422 Encoding Categorical Features Using Embeddings 423 Using Feature Columns for Parsing 426 Using Feature Columns in Your Models 426 TF Transform 428 The TensorFlow Datasets (TFDS) Project 429 14. Deep Computer Vision Using Convolutional Neural Networks................... 431 The Architecture of the Visual Cortex 432 Convolutional Layer 434 Filters 436 Stacking Multiple Feature Maps 437 TensorFlow Implementation 439 Table of Contents | ix Memory Requirements 441 Pooling Layer 442 TensorFlow Implementation 444 CNN Architectures 446 LeNet-5 449 AlexNet 450 GoogLeNet 452 VGGNet 456 ResNet 457 Xception 459 SENet 461 Implementing a ResNet-34 CNN Using Keras 464 Using Pretrained Models From Keras 465 Pretrained Models for Transfer Learning 467 Classification and Localization 469 Object Detection 471 Fully Convolutional Networks (FCNs) 473 You Only Look Once (YOLO) 475 Semantic Segmentation 478 Exercises 482 x | Table of Contents Preface The Machine Learning Tsunami In 2006, Geoffrey Hinton et al. published a paper1 showing how to train a deep neural network capable of recognizing handwritten digits with state-of-the-art precision (>98%). They branded this technique “Deep Learning.” Training a deep neural net was widely considered impossible at the time,2 and most researchers had abandoned the idea since the 1990s. This paper revived the interest of the scientific community and before long many new papers demonstrated that Deep Learning was not only possible, but capable of mind-blowing achievements that no other Machine Learning (ML) technique could hope to match (with the help of tremendous computing power and great amounts of data). This enthusiasm soon extended to many other areas of Machine Learning. Fast-forward 10 years and Machine Learning has conquered the industry: it is now at the heart of much of the magic in today’s high-tech products, ranking your web search results, powering your smartphone’s speech recognition, recommending vid‐ eos, and beating the world champion at the game of Go. Before you know it, it will be driving your car. Machine Learning in Your Projects So naturally you are excited about Machine Learning and you would love to join the party! Perhaps you would like to give your homemade robot a brain of its own? Make it rec‐ ognize faces? Or learn to walk around? 1 Available on Hinton’s home page at http://www.cs.toronto.edu/~hinton/. 2 Despite the fact that Yann Lecun’s deep convolutional neural networks had worked well for image recognition since the 1990s, although they were not as general purpose. xi Or maybe your company has tons of data (user logs, financial data, production data, machine sensor data, hotline stats, HR reports, etc.), and more than likely you could unearth some hidden gems if you just knew where to look; for example: Segment customers and find the best marketing strategy for each group Recommend products for each client based on what similar clients bought Detect which transactions are likely to be fraudulent Forecast next year’s revenue And more Whatever the reason, you have decided to learn Machine Learning and implement it in your projects. Great idea! Objective and Approach This book assumes that you know close to nothing about Machine Learning. Its goal is to give you the concepts, the intuitions, and the tools you need to actually imple‐ ment programs capable of learning from data. We will cover a large number of techniques, from the simplest and most commonly used (such as linear regression) to some of the Deep Learning techniques that regu‐ larly win competitions. Rather than implementing our own toy versions of each algorithm, we will be using actual production-ready Python frameworks: Scikit-Learn is very easy to use, yet it implements many Machine Learning algo‐ rithms efficiently, so it makes for a great entry point to learn Machine Learning. TensorFlow is a more complex library for distributed numerical computation. It makes it possible to train and run very large neural networks efficiently by dis‐ tributing the computations across potentially hundreds of multi-GPU servers. TensorFlow was created at Google and supports many of their large-scale Machine Learning applications. It was open sourced in November 2015. Keras is a high level Deep Learning API that makes it very simple to train and run neural networks. It can run on top of either TensorFlow, Theano or Micro‐ soft Cognitive Toolkit (formerly known as CNTK). TensorFlow comes with its own implementation of this API, called tf.keras, which provides support for some advanced TensorFlow features (e.g., to efficiently load data). The book favors a hands-on approach, growing an intuitive understanding of Machine Learning through concrete working examples and just a little bit of theory. While you can read this book without picking up your laptop, we highly recommend xii | Preface you experiment with the code examples available online as Jupyter notebooks at https://github.com/ageron/handson-ml2. Prerequisites This book assumes that you have some Python programming experience and that you are familiar with Python’s main scientific libraries, in particular NumPy, Pandas, and Matplotlib. Also, if you care about what’s under the hood you should have a reasonable under‐ standing of college-level math as well (calculus, linear algebra, probabilities, and sta‐ tistics). If you don’t know Python yet, http://learnpython.org/ is a great place to start. The offi‐ cial tutorial on python.org is also quite good. If you have never used Jupyter, Chapter 2 will guide you through installation and the basics: it is a great tool to have in your toolbox. If you are not familiar with Python’s scientific libraries, the provided Jupyter note‐ books include a few tutorials. There is also a quick math tutorial for linear algebra. Roadmap This book is organized in two parts. Part I, The Fundamentals of Machine Learning, covers the following topics: What is Machine Learning? What problems does it try to solve? What are the main categories and fundamental concepts of Machine Learning systems? The main steps in a typical Machine Learning project. Learning by fitting a model to data. Optimizing a cost function. Handling, cleaning, and preparing data. Selecting and engineering features. Selecting a model and tuning hyperparameters using cross-validation. The main challenges of Machine Learning, in particular underfitting and overfit‐ ting (the bias/variance tradeoff). Reducing the dimensionality of the training data to fight the curse of dimension‐ ality. Other unsupervised learning techniques, including clustering, density estimation and anomaly detection. Preface | xiii The most common learning algorithms: Linear and Polynomial Regression, Logistic Regression, k-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forests, and Ensemble methods. xiv | Preface Part II, Neural Networks and Deep Learning, covers the following topics: What are neural nets? What are they good for? Building and training neural nets using TensorFlow and Keras. The most important neural net architectures: feedforward neural nets, convolu‐ tional nets, recurrent nets, long short-term memory (LSTM) nets, autoencoders and generative adversarial networks (GANs). Techniques for training deep neural nets. Scaling neural networks for large datasets. Learning strategies with Reinforcement Learning. Handling uncertainty with Bayesian Deep Learning. The first part is based mostly on Scikit-Learn while the second part uses TensorFlow and Keras. Don’t jump into deep waters too hastily: while Deep Learning is no doubt one of the most exciting areas in Machine Learning, you should master the fundamentals first. Moreover, most problems can be solved quite well using simpler techniques such as Random Forests and Ensemble methods (discussed in Part I). Deep Learn‐ ing is best suited for complex problems such as image recognition, speech recognition, or natural language processing, provided you have enough data, computing power, and patience. Other Resources Many resources are available to learn about Machine Learning. Andrew Ng’s ML course on Coursera and Geoffrey Hinton’s course on neural networks and Deep Learning are amazing, although they both require a significant time investment (think months). There are also many interesting websites about Machine Learning, including of course Scikit-Learn’s exceptional User Guide. You may also enjoy Dataquest, which provides very nice interactive tutorials, and ML blogs such as those listed on Quora. Finally, the Deep Learning website has a good list of resources to learn more. Of course there are also many other introductory books about Machine Learning, in particular: Joel Grus, Data Science from Scratch (O’Reilly). This book presents the funda‐ mentals of Machine Learning, and implements some of the main algorithms in pure Python (from scratch, as the name suggests). Preface | xv Stephen Marsland, Machine Learning: An Algorithmic Perspective (Chapman and Hall). This book is a great introduction to Machine Learning, covering a wide range of topics in depth, with code examples in Python (also from scratch, but using NumPy). Sebastian Raschka, Python Machine Learning (Packt Publishing). Also a great introduction to Machine Learning, this book leverages Python open source libra‐ ries (Pylearn 2 and Theano). François Chollet, Deep Learning with Python (Manning). A very practical book that covers a large range of topics in a clear and concise way, as you might expect from the author of the excellent Keras library. It favors code examples over math‐ ematical theory. Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, Learning from Data (AMLBook). A rather theoretical approach to ML, this book provides deep insights, in particular on the bias/variance tradeoff (see Chapter 4). Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3rd Edition (Pearson). This is a great (and huge) book covering an incredible amount of topics, including Machine Learning. It helps put ML into perspective. Finally, a great way to learn is to join ML competition websites such as Kaggle.com this will allow you to practice your skills on real-world problems, with help and insights from some of the best ML professionals out there. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. xvi | Preface This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/ageron/handson-ml2. It is mostly composed of Jupyter notebooks. Some of the code examples in the book leave out some repetitive sections, or details that are obvious or unrelated to Machine Learning. This keeps the focus on the important parts of the code, and it saves space to cover more topics. However, if you want the full code examples, they are all available in the Jupyter notebooks. Note that when the code examples display some outputs, then these code examples are shown with Python prompts (>>> and...), as in a Python shell, to clearly distin‐ guish the code from the outputs. For example, this code defines the square() func‐ tion then it computes and displays the square of 3: >>> def square(x):... return x ** 2... >>> result = square(3) >>> result 9 When code does not display anything, prompts are not used. However, the result may sometimes be shown as a comment like this: def square(x): return x ** 2 result = square(3) # result is 9 Preface | xvii Using Code Examples This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Géron (O’Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9.” If you feel your use of code examples falls out‐ side fair use or the permission given above, feel free to contact us at permis‐ [email protected]. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐ sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit http://oreilly.com/safari. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) xviii | Preface 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/hands-on-machine-learning- with-scikit-learn-and-tensorflow or https://homl.info/oreilly. To comment or ask technical questions about this book, send email to bookques‐ [email protected]. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Changes in the Second Edition This second edition has five main objectives: 1. Cover additional topics: additional unsupervised learning techniques (including clustering, anomaly detection, density estimation and mixture models), addi‐ tional techniques for training deep nets (including self-normalized networks), additional computer vision techniques (including the Xception, SENet, object detection with YOLO, and semantic segmentation using R-CNN), handling sequences using CNNs (including WaveNet), natural language processing using RNNs, CNNs and Transformers, generative adversarial networks, deploying Ten‐ sorFlow models, and more. 2. Update the book to mention some of the latest results from Deep Learning research. 3. Migrate all TensorFlow chapters to TensorFlow 2, and use TensorFlow’s imple‐ mentation of the Keras API (called tf.keras) whenever possible, to simplify the code examples. 4. Update the code examples to use the latest version of Scikit-Learn, NumPy, Pan‐ das, Matplotlib and other libraries. 5. Clarify some sections and fix some errors, thanks to plenty of great feedback from readers. Some chapters were added, others were rewritten and a few were reordered. Table P-1 shows the mapping between the 1st edition chapters and the 2nd edition chapters: Preface | xix Table P-1. Chapter mapping between 1st and 2nd edition 1st Ed. chapter 2nd Ed. Chapter % Changes 2nd Ed. Title 1 1 1: self.skip_layers = [ DefaultConv2D(filters, kernel_size=1, strides=strides), keras.layers.BatchNormalization()] def call(self, inputs): Z = inputs for layer in self.main_layers: Z = layer(Z) skip_Z = inputs for layer in self.skip_layers: 464 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks skip_Z = layer(skip_Z) return self.activation(Z + skip_Z) As you can see, this code matches Figure 14-18 pretty closely. In the constructor, we create all the layers we will need: the main layers are the ones on the right side of the diagram, and the skip layers are the ones on the left (only needed if the stride is greater than 1). Then in the call() method, we simply make the inputs go through the main layers, and the skip layers (if any), then we add both outputs and we apply the activation function. Next, we can build the ResNet-34 simply using a Sequential model, since it is really just a long sequence of layers (we can treat each residual unit as a single layer now that we have the ResidualUnit class): model = keras.models.Sequential() model.add(DefaultConv2D(64, kernel_size=7, strides=2, input_shape=[224, 224, 3])) model.add(keras.layers.BatchNormalization()) model.add(keras.layers.Activation("relu")) model.add(keras.layers.MaxPool2D(pool_size=3, strides=2, padding="SAME")) prev_filters = 64 for filters in * 3 + * 4 + * 6 + * 3: strides = 1 if filters == prev_filters else 2 model.add(ResidualUnit(filters, strides=strides)) prev_filters = filters model.add(keras.layers.GlobalAvgPool2D()) model.add(keras.layers.Flatten()) model.add(keras.layers.Dense(10, activation="softmax")) The only slightly tricky part in this code is the loop that adds the ResidualUnit layers to the model: as explained earlier, the first 3 RUs have 64 filters, then the next 4 RUs have 128 filters, and so on. We then set the strides to 1 when the number of filters is the same as in the previous RU, or else we set it to 2. Then we add the ResidualUnit, and finally we update prev_filters. It is quite amazing that in less than 40 lines of code, we can build the model that won the ILSVRC 2015 challenge! It demonstrates both the elegance of the ResNet model, and the expressiveness of the Keras API. Implementing the other CNN architectures is not much harder. However, Keras comes with several of these architectures built in, so why not use them instead? Using Pretrained Models From Keras In general, you won’t have to implement standard models like GoogLeNet or ResNet manually, since pretrained networks are readily available with a single line of code, in the keras.applications package. For example: model = keras.applications.resnet50.ResNet50(weights="imagenet") Using Pretrained Models From Keras | 465 That’s all! This will create a ResNet-50 model and download weights pretrained on the ImageNet dataset. To use it, you first need to ensure that the images have the right size. A ResNet-50 model expects 224 × 224 images (other models may expect other sizes, such as 299 × 299), so let’s use TensorFlow’s tf.image.resize() function to resize the images we loaded earlier: images_resized = tf.image.resize(images, [224, 224]) The tf.image.resize() will not preserve the aspect ratio. If this is a problem, you can try cropping the images to the appropriate aspect ratio before resizing. Both operations can be done in one shot with tf.image.crop_and_resize(). The pretrained models assume that the images are preprocessed in a specific way. In some cases they may expect the inputs to be scaled from 0 to 1, or -1 to 1, and so on. Each model provides a preprocess_input() function that you can use to preprocess your images. These functions assume that the pixel values range from 0 to 255, so we must multiply them by 255 (since earlier we scaled them to the 0–1 range): inputs = keras.applications.resnet50.preprocess_input(images_resized * 255) Now we can use the pretrained model to make predictions: Y_proba = model.predict(inputs) As usual, the output Y_proba is a matrix with one row per image and one column per class (in this case, there are 1,000 classes). If you want to display the top K predic‐ tions, including the class name and the estimated probability of each predicted class, you can use the decode_predictions() function. For each image, it returns an array containing the top K predictions, where each prediction is represented as an array containing the class identifier21, its name and the corresponding confidence score: top_K = keras.applications.resnet50.decode_predictions(Y_proba, top=3) for image_index in range(len(images)): print("Image #{}".format(image_index)) for class_id, name, y_proba in top_K[image_index]: print(" {} - {:12s} {:.2f}%".format(class_id, name, y_proba * 100)) print() The output looks like this: Image #0 n03877845 - palace 42.87% n02825657 - bell_cote 40.57% n03781244 - monastery 14.56% 21 In the ImageNet dataset, each image is associated to a word in the WordNet dataset: the class ID is just a WordNet ID. 466 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks Image #1 n04522168 - vase 46.83% n07930864 - cup 7.78% n11939491 - daisy 4.87% The correct classes (monastery and daisy) appear in the top 3 results for both images. That’s pretty good considering that the model had to choose among 1,000 classes. As you can see, it is very easy to create a pretty good image classifier using a pre‐ trained model. Other vision models are available in keras.applications, including several ResNet variants, GoogLeNet variants like InceptionV3 and Xception, VGGNet variants, MobileNet and MobileNetV2 (lightweight models for use in mobile applications), and more. But what if you want to use an image classifier for classes of images that are not part of ImageNet? In that case, you may still benefit from the pretrained models to per‐ form transfer learning. Pretrained Models for Transfer Learning If you want to build an image classifier, but you do not have enough training data, then it is often a good idea to reuse the lower layers of a pretrained model, as we dis‐ cussed in Chapter 11. For example, let’s train a model to classify pictures of flowers, reusing a pretrained Xception model. First, let’s load the dataset using TensorFlow Datasets (see Chapter 13): import tensorflow_datasets as tfds dataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True) dataset_size = info.splits["train"].num_examples # 3670 class_names = info.features["label"].names # ["dandelion", "daisy",...] n_classes = info.features["label"].num_classes # 5 Note that you can get information about the dataset by setting with_info=True. Here, we get the dataset size and the names of the classes. Unfortunately, there is only a "train" dataset, no test set or validation set, so we need to split the training set. The TF Datasets project provides an API for this. For example, let’s take the first 10% of the dataset for testing, the next 15% for validation, and the remaining 75% for train‐ ing: test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75]) test_set = tfds.load("tf_flowers", split=test_split, as_supervised=True) valid_set = tfds.load("tf_flowers", split=valid_split, as_supervised=True) train_set = tfds.load("tf_flowers", split=train_split, as_supervised=True) Pretrained Models for Transfer Learning | 467 Next we must preprocess the images. The CNN expects 224 × 224 images, so we need to resize them. We also need to run the image through Xception’s prepro cess_input() function: def preprocess(image, label): resized_image = tf.image.resize(image, [224, 224]) final_image = keras.applications.xception.preprocess_input(resized_image) return final_image, label Let’s apply this preprocessing function to all 3 datasets, and let’s also shuffle & repeat the training set, and add batching & prefetching to all datasets: batch_size = 32 train_set = train_set.shuffle(1000).repeat() train_set = train_set.map(preprocess).batch(batch_size).prefetch(1) valid_set = valid_set.map(preprocess).batch(batch_size).prefetch(1) test_set = test_set.map(preprocess).batch(batch_size).prefetch(1) If you want to perform some data augmentation, you can just change the preprocess‐ ing function for the training set, adding some random transformations to the training images. For example, use tf.image.random_crop() to randomly crop the images, use tf.image.random_flip_left_right() to randomly flip the images horizontally, and so on (see the notebook for an example). Next let’s load an Xception model, pretrained on ImageNet. We exclude the top of the network (by setting include_top=False): this excludes the global average pooling layer and the dense output layer. We then add our own global average pooling layer, based on the output of the base model, followed by a dense output layer with 1 unit per class, using the softmax activation function. Finally, we create the Keras Model: base_model = keras.applications.xception.Xception(weights="imagenet", include_top=False) avg = keras.layers.GlobalAveragePooling2D()(base_model.output) output = keras.layers.Dense(n_classes, activation="softmax")(avg) model = keras.models.Model(inputs=base_model.input, outputs=output) As explained in Chapter 11, it’s usually a good idea to freeze the weights of the pre‐ trained layers, at least at the beginning of training: for layer in base_model.layers: layer.trainable = False Since our model uses the base model’s layers directly, rather than the base_model object itself, setting base_model.trainable=False would have no effect. Finally, we can compile the model and start training: 468 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks optimizer = keras.optimizers.SGD(lr=0.2, momentum=0.9, decay=0.01) model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"]) history = model.fit(train_set, steps_per_epoch=int(0.75 * dataset_size / batch_size), validation_data=valid_set, validation_steps=int(0.15 * dataset_size / batch_size), epochs=5) This will be very slow, unless you have a GPU. If you do not, then you should run this chapter’s notebook in Colab, using a GPU run‐ time (it’s free!). See the instructions at https://github.com/ageron/ handson-ml2. After training the model for a few epochs, its validation accuracy should reach about 75-80%, and stop making much progress. This means that the top layers are now pretty well trained, so we are ready to unfreeze all layers (or you could try unfreezing just the top ones), and continue training (don’t forget to compile the model when you freeze or unfreeze layers). This time we use a much lower learning rate to avoid dam‐ aging the pretrained weights: for layer in base_model.layers: layer.trainable = True optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.001) model.compile(...) history = model.fit(...) It will take a while, but this model should reach around 95% accuracy on the test set. With that, you can start training amazing image classifiers! But there’s more to com‐ puter vision than just classification. For example, what if you also want to know where the flower is in the picture? Let’s look at this now. Classification and Localization Localizing an object in a picture can be expressed as a regression task, as discussed in Chapter 10: to predict a bounding box around the object, a common approach is to predict the horizontal and vertical coordinates of the object’s center, as well as its height and width. This means we have 4 numbers to predict. It does not require much change to the model, we just need to add a second dense output layer with 4 units (typically on top of the global average pooling layer), and it can be trained using the MSE loss: base_model = keras.applications.xception.Xception(weights="imagenet", include_top=False) avg = keras.layers.GlobalAveragePooling2D()(base_model.output) class_output = keras.layers.Dense(n_classes, activation="softmax")(avg) Classification and Localization | 469 loc_output = keras.layers.Dense(4)(avg) model = keras.models.Model(inputs=base_model.input, outputs=[class_output, loc_output]) model.compile(loss=["sparse_categorical_crossentropy", "mse"], loss_weights=[0.8, 0.2], # depends on what you care most about optimizer=optimizer, metrics=["accuracy"]) But now we have a problem: the flowers dataset does not have bounding boxes around the flowers. So we need to add them ourselves. This is often one of the hard‐ est and most costly part of a Machine Learning project: getting the labels. It’s a good idea to spend time looking for the right tools. To annotate images with bounding boxes, you may want to use an open source image labeling tool like VGG Image Annotator, LabelImg, OpenLabeler or ImgLab, or perhaps a commercial tool like LabelBox or Supervisely. You may also want to consider crowdsourcing platforms such as Amazon Mechanical Turk or CrowdFlower if you have a very large number of images to annotate. However, it is quite a lot of work to setup a crowdsourcing plat‐ form, prepare the form to be sent to the workers, to supervise them and ensure the quality of the bounding boxes they produce is good, so make sure it is worth the effort: if there are just a few thousand images to label, and you don’t plan to do this frequently, it may be preferable to do it yourself. Adriana Kovashka et al. wrote a very practical paper22 about crowdsourcing in Computer Vision, I recommend you check it out, even if you do not plan to use crowdsourcing. So let’s suppose you obtained the bounding boxes for every image in the flowers data‐ set (for now we will assume there is a single bounding box per image), you then need to create a dataset whose items will be batches of preprocessed images along with their class labels and their bounding boxes. Each item should be a tuple of the form: (images, (class_labels, bounding_boxes)). Then you are ready to train your model! The bounding boxes should be normalized so that the horizontal and vertical coordinates, as well as the height and width all range from 0 to 1. Also, it is common to predict the square root of the height and width rather than the height and width directly: this way, a 10 pixel error for a large bounding box will not be penalized as much as a 10 pixel error for a small bounding box. The MSE often works fairly well as a cost function to train the model, but it is not a great metric to evaluate how well the model can predict bounding boxes. The most common metric for this is the Intersection over Union (IoU): it is the area of overlap between the predicted bounding box and the target bounding box, divided by the 22 “Crowdsourcing in Computer Vision,” A. Kovashka et al. (2016). 470 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks area of their union (see Figure 14-23). In tf.keras, it is implemented by the tf.keras.metrics.MeanIoU class. Figure 14-23. Intersection over Union (IoU) Metric for Bounding Boxes Classifying and localizing a single object is nice, but what if the images contain multi‐ ple objects (as is often the case in the flowers dataset)? Object Detection The task of classifying and localizing multiple objects in an image is called object detection. Until a few years ago, a common approach was to take a CNN that was trained to classify and locate a single object, then slide it across the image, as shown in Figure 14-24. In this example, the image was chopped into a 6 × 8 grid, and we show a CNN (the thick black rectangle) sliding across all 3 × 3 regions. When the CNN was looking at the top left of the image, it detected part of the left-most rose, and then it detected that same rose again when it was first shifted one step to the right. At the next step, it started detecting part of the top-most rose, and then it detec‐ ted it again once it was shifted one more step to the right. You would then continue to slide the CNN through the whole image, looking at all 3 × 3 regions. Moreover, since objects can have varying sizes, you would also slide the CNN across regions of differ‐ ent sizes. For example, once you are done with the 3 × 3 regions, you might want to slide the CNN across all 4 × 4 regions as well. Object Detection | 471 Figure 14-24. Detecting Multiple Objects by Sliding a CNN Across the Image This technique is fairly straightforward, but as you can see it will detect the same object multiple times, at slightly different positions. Some post-processing will then be needed to get rid of all the unnecessary bounding boxes. A common approach for this is called non-max suppression: First, you need to add an extra objectness output to your CNN, to estimate the probability that a flower is indeed present in the image (alternatively, you could add a “no-flower” class, but this usually does not work as well). It must use the sigmoid activation function and you can train it using the "binary_crossen tropy" loss. Then just get rid of all the bounding boxes for which the objectness score is below some threshold: this will drop all the bounding boxes that don’t actually contain a flower. Second, find the bounding box with the highest objectness score, and get rid of all the other bounding boxes that overlap a lot with it (e.g., with an IoU greater than 60%). For example, in Figure 14-24, the bounding box with the max object‐ ness score is the thick bounding box over the top-most rose (the objectness score is represented by the thickness of the bounding boxes). The other bounding box over that same rose overlaps a lot with the max bounding box, so we will get rid of it. 472 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks Third, repeat step two until there are no more bounding boxes to get rid of. This simple approach to object detection works pretty well, but it requires running the CNN many times, so it is quite slow. Fortunately, there is a much faster way to slide a CNN across an image: using a Fully Convolutional Network. Fully Convolutional Networks (FCNs) The idea of FCNs was first introduced in a 2015 paper23 by Jonathan Long et al., for semantic segmentation (the task of classifying every pixel in an image according to the class of the object it belongs to). They pointed out that you could replace the dense layers at the top of a CNN by convolutional layers. To understand this, let’s look at an example: suppose a dense layer with 200 neurons sits on top of a convolutional layer that outputs 100 feature maps, each of size 7 × 7 (this is the feature map size, not the kernel size). Each neuron will compute a weighted sum of all 100 × 7 × 7 activa‐ tions from the convolutional layer (plus a bias term). Now let’s see what happens if we replace the dense layer with a convolution layer using 200 filters, each 7 × 7, and with VALID padding. This layer will output 200 feature maps, each 1 × 1 (since the kernel is exactly the size of the input feature maps and we are using VALID padding). In other words, it will output 200 numbers, just like the dense layer did, and if you look closely at the computations performed by a convolutional layer, you will notice that these numbers will be precisely the same as the dense layer produced. The only differ‐ ence is that the dense layer’s output was a tensor of shape [batch size, 200] while the convolutional layer will output a tensor of shape [batch size, 1, 1, 200]. To convert a dense layer to a convolutional layer, the number of fil‐ ters in the convolutional layer must be equal to the number of units in the dense layer, the filter size must be equal to the size of the input feature maps, and you must use VALID padding. The stride may be set to 1 or more, as we will see shortly. Why is this important? Well, while a dense layer expects a specific input size (since it has one weight per input feature), a convolutional layer will happily process images of any size24 (however, it does expect its inputs to have a specific number of channels, since each kernel contains a different set of weights for each input channel). Since an FCN contains only convolutional layers (and pooling layers, which have the same property), it can be trained and executed on images of any size! 23 “Fully Convolutional Networks for Semantic Segmentation,” J. Long, E. Shelhamer, T. Darrell (2015). 24 There is one small exception: a convolutional layer using VALID padding will complain if the input size is smaller than the kernel size. Object Detection | 473 For example, suppose we already trained a CNN for flower classification and localiza‐ tion. It was trained on 224 × 224 images and it outputs 10 numbers: outputs 0 to 4 are sent through the softmax activation function, and this gives the class probabilities (one per class); output 5 is sent through the logistic activation function, and this gives the objectness score; outputs 6 to 9 do not use any activation function, and they rep‐ resent the bounding box’s center coordinates, and its height and width. We can now convert its dense layers to convolutional layers. In fact, we don’t even need to retrain it, we can just copy the weights from the dense layers to the convolutional layers! Alternatively, we could have converted the CNN into an FCN before training. Now suppose the last convolutional layer before the output layer (also called the bot‐ tleneck layer) outputs 7 × 7 feature maps when the network is fed a 224 × 224 image (see the left side of Figure 14-25). If we feed the FCN a 448 × 448 image (see the right side of Figure 14-25), the bottleneck layer will now output 14 × 14 feature maps.25 Since the dense output layer was replaced by a convolutional layer using 10 filters of size 7 × 7, VALID padding and stride 1, the output will be composed of 10 features maps, each of size 8 × 8 (since 14 - 7 + 1 = 8). In other words, the FCN will process the whole image only once and it will output an 8 × 8 grid where each cell contains 10 numbers (5 class probabilities, 1 objectness score and 4 bounding box coordinates). It’s exactly like taking the original CNN and sliding it across the image using 8 steps per row and 8 steps per column: to visualize this, imagine chopping the original image into a 14 × 14 grid, then sliding a 7 × 7 window across this grid: there will be 8 × 8 = 64 possible locations for the window, hence 8 × 8 predictions. However, the FCN approach is much more efficient, since the network only looks at the image once. In fact, You Only Look Once (YOLO) is the name of a very popular object detec‐ tion architecture! 25 This assumes we used only SAME padding in the network: indeed, VALID padding would reduce the size of the feature maps. Moreover, 448 can be neatly divided by 2 several times until we reach 7, without any round‐ ing error. If any layer uses a different stride than 1 or 2, then there may be some rounding error, so again the feature maps may end up being smaller. 474 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks Figure 14-25. A Fully Convolutional Network Processing a Small Image (left) and a Large One (right) You Only Look Once (YOLO) YOLO is an extremely fast and accurate object detection architecture proposed by Joseph Redmon et al. in a 2015 paper26, and subsequently improved in 201627 (YOLOv2) and in 2018 28 (YOLOv3). It is so fast that it can run in realtime on a video (check out this nice demo). YOLOv3’s architecture is quite similar to the one we just discussed, but with a few important differences: 26 “You Only Look Once: Unified, Real-Time Object Detection,” J. Redmon, S. Divvala, R. Girshick, A. Farhadi (2015). 27 “YOLO9000: Better, Faster, Stronger,” J. Redmon, A. Farhadi (2016). 28 “YOLOv3: An Incremental Improvement,” J. Redmon, A. Farhadi (2018). Object Detection | 475 First, it outputs 5 bounding boxes for each grid cell (instead of just 1), and each bounding box comes with an objectness score. It also outputs 20 class probabili‐ ties per grid cell, as it was trained on the PASCAL VOC dataset, which contains 20 classes. That’s a total of 45 numbers per grid cell (5 * 4 bounding box coordi‐ nates, plus 5 objectness scores, plus 20 class probabilities). Second, instead of predicting the absolute coordinates of the bounding box cen‐ ters, YOLOv3 predicts an offset relative to the coordinates of the grid cell, where (0, 0) means the top left of that cell, and (1, 1) means the bottom right. For each grid cell, YOLOv3 is trained to predict only bounding boxes whose center lies in that cell (but the bounding box itself generally extends well beyond the grid cell). YOLOv3 applies the logistic activation function to the bounding box coordinates to ensure they remain in the 0 to 1 range. Third, before training the neural net, YOLOv3 finds 5 representative bounding box dimensions, called anchor boxes (or bounding box priors): it does this by applying the K-Means algorithm (see ???) to the height and width of the training set bounding boxes. For example, if the training images contain many pedes‐ trians, then one of the anchor boxes will likely have the dimensions of a typical pedestrian. Then when the neural net predicts 5 bounding boxes per grid cell, it actually predicts how much to rescale each of the anchor boxes. For example, suppose one anchor box is 100 pixels tall and 50 pixels wide, and the network predicts, say, a vertical rescaling factor of 1.5 and a horizontal rescaling of 0.9 (for one of the grid cells), this will result in a predicted bounding box of size 150 × 45 pixels. To be more precise, for each grid cell and each anchor box, the network predicts the log of the vertical and horizontal rescaling factors. Having these pri‐ ors makes the network more likely to predict bounding boxes of the appropriate dimensions, and it also speeds up training since it will more quickly learn what reasonable bounding boxes look like. Fourth, the network is trained using images of different scales: every few batches during training, the network randomly chooses a new image dimension (from 330 × 330 to 608 × 608 pixels). This allows the network to learn to detect objects at different scales. Moreover, it makes it possible to use YOLOv3 at different scales: the smaller scale will be less accurate but faster than the larger scale, so you can choose the right tradeoff for your use case. There are a few more innovations you might be interested in, such as the use of skip connections to recover some of the spatial resolution that is lost in the CNN (we will discuss this shortly when we look at semantic segmentation). Moreover, in the 2016 paper, the authors introduce the YOLO9000 model that uses hierarchical classifica‐ tion: the model predicts a probability for each node in a visual hierarchy called Word‐ Tree. This makes it possible for the network to predict with high confidence that an image represents, say, a dog, even though it is unsure what specific type of dog it is. 476 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks So I encourage you to go ahead and read all three papers: they are quite pleasant to read, and it is an excellent example of how Deep Learning systems can be incremen‐ tally improved. Mean Average Precision (mAP) A very common metric used in object detection tasks is the mean Average Precision (mAP). “Mean Average” sounds a bit redundant, doesn’t it? To understand this met‐ ric, let’s go back to two classification metrics we discussed in Chapter 3: precision and recall. Remember the tradeoff: the higher the recall, the lower the precision. You can visualize this in a Precision/Recall curve (see Figure 3-5). To summarize this curve into a single number, we could compute its Area Under the Curve (AUC). But note that the Precision/Recall curve may contain a few sections where precision actually goes up when recall increases, especially at low recall values (you can see this at the top left of Figure 3-5). This is one of the motivations for the mAP metric. Suppose the classifier has a 90% precision at 10% recall, but a 96% precision at 20% recall: there’s really no tradeoff here: it simply makes more sense to use the classifier at 20% recall rather than at 10% recall, as you will get both higher recall and higher precision. So instead of looking at the precision at 10% recall, we should really be looking at the maximum precision that the classifier can offer with at least 10% recall. It would be 96%, not 90%. So one way to get a fair idea of the model’s performance is to compute the maximum precision you can get with at least 0% recall, then 10% recall, 20%, and so on up to 100%, and then calculate the mean of these maximum precisions. This is called the Average Precision (AP) metric. Now when there are more than 2 classes, we can compute the AP for each class, and then compute the mean AP (mAP). That’s it! However, in an object detection systems, there is an additional level of complexity: what if the system detected the correct class, but at the wrong location (i.e., the bounding box is completely off)? Surely we should not count this as a positive predic‐ tion. So one approach is to define an IOU threshold: for example, we may consider that a prediction is correct only if the IOU is greater than, say, 0.5, and the predicted class is correct. The corresponding mAP is generally noted [email protected] (or mAP@50%, or sometimes just AP50). In some competitions (such as the Pascal VOC challenge), this is what is done. In others (such as the COCO competition), the mAP is computed for different IOU thresholds (0.50, 0.55, 0.60, …, 0.95), and the final metric is the mean of all these mAPs (noted AP@[.50:.95] or AP@[.50:0.05:.95]). Yes, that’s a mean mean average. Several YOLO implementations built using TensorFlow are available on github, some with pretrained weights. At the time of writing, they are based on TensorFlow 1, but by the time you read this, TF 2 implementations will certainly be available. Moreover, other object detection models are available in the TensorFlow Models project, many Object Detection | 477 with pretrained weights, and some have even been ported to TF Hub, making them extremely easy to use, such as SSD29 and Faster-RCNN.30, which are both quite popu‐ lar. SSD is also a “single shot” detection model, quite similar to YOLO, while Faster R- CNN is more complex: the image first goes through a CNN, and the output is passed to a Region Proposal Network (RPN) which proposes bounding boxes that are most likely to contain an object, and a classifier is run for each bounding box, based on the cropped output of the CNN. The choice of detection system depends on many factors: speed, accuracy, available pretrained models, training time, complexity, etc. The papers contain tables of met‐ rics, but there is quite a lot of variability in the testing environments, and the technol‐ ogies evolve so fast that it is difficulty to make a fair comparison that will be useful for most people and remain valid for more than a few months. Great! So we can locate objects by drawing bounding boxes around them. But per‐ haps you might want to be a bit more precise. Let’s see how to go down to the pixel level. Semantic Segmentation In semantic segmentation, each pixel is classified according to the class of the object it belongs to (e.g., road, car, pedestrian, building, etc.), as shown in Figure 14-26. Note that different objects of the same class are not distinguished. For example, all the bicy‐ cles on the right side of the segmented image end up as one big lump of pixels. The main difficulty in this task is that when images go through a regular CNN, they grad‐ ually lose their spatial resolution (due to the layers with strides greater than 1): so a regular CNN may end up knowing that there’s a person in the image, somewhere in the bottom left of the image, but it will not be much more precise than that. 29 “SSD: Single Shot MultiBox Detector,” Wei Liu et al. (2015). 30 “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Shaoqing Ren et al. (2015). 478 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks Figure 14-26. Semantic segmentation Just like for object detection, there are many different approaches to tackle this prob‐ lem, some quite complex. However, a fairly simple solution was proposed in the 2015 paper by Jonathan Long et al. we discussed earlier. They start by taking a pretrained CNN and turning into an FCN, as discussed earlier. The CNN applies a stride of 32 to the input image overall (i.e., if you add up all the strides greater than 1), meaning the last layer outputs feature maps that are 32 times smaller than the input image. This is clearly too coarse, so they add a single upsampling layer that multiplies the resolution by 32. There are several solutions available for upsampling (increasing the size of an image), such as bilinear interpolation, but it only works reasonably well up to ×4 or ×8. Instead, they used a transposed convolutional layer:31 it is equivalent to first stretching the image by inserting empty rows and columns (full of zeros), then per‐ forming a regular convolution (see Figure 14-27). Alternatively, some people prefer to think of it as a regular convolutional layer that uses fractional strides (e.g., 1/2 in Figure 14-27). The transposed convolutional layer can be initialized to perform some‐ thing close to linear interpolation, but since it is a trainable layer, it will learn to do better during training. 31 This type of layer is sometimes referred to as a deconvolution layer, but it does not perform what mathemati‐ cians call a deconvolution, so this name should be avoided. Semantic Segmentation | 479 Figure 14-27. Upsampling Using a Transpose Convolutional Layer In a transposed convolution layer, the stride defines how much the input will be stretched, not the size of the filter steps, so the larger the stride, the larger the output (unlike for convolutional layers or pooling layers). TensorFlow Convolution Operations TensorFlow also offers a few other kinds of convolutional layers: keras.layers.Conv1D creates a convolutional layer for 1D inputs, such as time series or text (sequences of letters or words), as we will see in ???. keras.layers.Conv3D creates a convolutional layer for 3D inputs, such as 3D PET scan. Setting the dilation_rate hyperparameter of any convolutional layer to a value of 2 or more creates an à-trous convolutional layer (“à trous” is French for “with holes”). This is equivalent to using a regular convolutional layer with a filter dila‐ ted by inserting rows and columns of zeros (i.e., holes). For example, a 1 × 3 filter equal to [[1,2,3]] may be dilated with a dilation rate of 4, resulting in a dilated filter [[1, 0, 0, 0, 2, 0, 0, 0, 3]]. This allows the convolutional layer to 480 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks have a larger receptive field at no computational price and using no extra param‐ eters. tf.nn.depthwise_conv2d() can be used to create a depthwise convolutional layer (but you need to create the variables yourself). It applies every filter to every individual input channel independently. Thus, if there are fn filters and fn′ input channels, then this will output fn × fn′ feature maps. This solution is okay, but still too imprecise. To do better, the authors added skip con‐ nections from lower layers: for example, they upsampled the output image by a factor of 2 (instead of 32), and they added the output of a lower layer that had this double resolution. Then they upsampled the result by a factor of 16, leading to a total upsam‐ pling factor of 32 (see Figure 14-28). This recovered some of the spatial resolution that was lost in earlier pooling layers. In their best architecture, they used a second similar skip connection to recover even finer details from an even lower layer: in short, the output of the original CNN goes through the following extra steps: upscale ×2, add the output of a lower layer (of the appropriate scale), upscale ×2, add the out‐ put of an even lower layer, and finally upscale ×8. It is even possible to scale up beyond the size of the original image: this can be used to increase the resolution of an image, which is a technique called super-resolution. Figure 14-28. Skip layers recover some spatial resolution from lower layers Once again, many github repositories provide TensorFlow implementations of semantic segmentation (TensorFlow 1 for now), and you will even find a pretrained instance segmentation model in the TensorFlow Models project. Instance segmenta‐ tion is similar to semantic segmentation, but instead of merging all objects of the same class into one big lump, each object is distinguished from the others (e.g., it identifies each individual bicycle). At the present, they provide multiple implementa‐ tions of the Mask R-CNN architecture, which was proposed in a 2017 paper: it extends the Faster R-CNN model by additionally producing a pixel-mask for each bounding box. So not only do you get a bounding box around each object, with a set of estimated class probabilities, you also get a pixel mask that locates pixels in the bounding box that belong to the object. Semantic Segmentation | 481 As you can see, the field of Deep Computer Vision is vast and moving fast, with all sorts of architectures popping out every year, all based on Convolutional Neural Net‐ works. The progress made in just a few years has been astounding, and researchers are now focusing on harder and harder problems, such as adversarial learning (which attempts to make the network more resistant to images designed to fool it), explaina‐ bility (understanding why the network makes a specific classification), realistic image generation (which we will come back to in ???), single-shot learning (a system that can recognize an object after it has seen it just once), and much more. Some even explore completely novel architectures, such as Geoffrey Hinton’s capsule networks32 (I pre‐ sented them in a couple videos, with the corresponding code in a notebook). Now on to the next chapter, where we will look at how to process sequential data such as time series using Recurrent Neural Networks and Convolutional Neural Networks. Exercises 1. What are the advantages of a CNN over a fully connected DNN for image classi‐ fication? 2. Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and SAME padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels. What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images? 3. If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem? 4. Why would you want to add a max pooling layer rather than a convolutional layer with the same stride? 5. When would you want to add a local response normalization layer? 6. Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet and Xception? 7. What is a Fully Convolutional Network? How can you convert a dense layer into a convolutional layer? 8. What is the main technical difficulty of semantic segmentation? 9. Build your own CNN from scratch and try to achieve the highest possible accu‐ racy on MNIST. 32 “Matrix Capsules with EM Routing,” G. Hinton, S. Sabour, N. Frosst (2018). 482 | Chapter 14: Deep Computer Vision Using Convolutional Neural Networks 10. Use transfer learning for large image classification. a. Create a training set containing at least 100 images per class. For example, you could classify your own pictures based on the location (beach, mountain, city, etc.), or alternatively you can just use an existing dataset (e.g., from Tensor‐ Flow Datasets). b. Split it into a training set, a validation set and a test set. c. Build the input pipeline, including the appropriate preprocessing operations, and optionally add data augmentation. d. Fine-tune a pretrained model on this dataset. 11. Go through TensorFlow’s DeepDream tutorial. It is a fun way to familiarize your‐ self with various ways of visualizing the patterns learned by a CNN, and to gener‐ ate art using Deep Learning. Solutions to these exercises are available in ???. Exercises | 483 About the Author Aurélien Géron is a Machine Learning consultant. A former Googler, he led the You‐ Tube video classification team from 2013 to 2016. He was also a founder and CTO of Wifirst from 2002 to 2012, a leading Wireless ISP in France; and a founder and CTO of Polyconseil in 2001, the firm that now manages the electric car sharing service Autolib’. Before this he worked as an engineer in a variety of domains: finance (JP Morgan and Société Générale), defense (Canada’s DOD), and healthcare (blood transfusion). He published a few technical books (on C++, WiFi, and internet architectures), and was a Computer Science lecturer in a French engineering school. A few fun facts: he taught his three children to count in binary with their fingers (up to 1023), he studied microbiology and evolutionary genetics before going into soft‐ ware engineering, and his parachute didn’t open on the second jump. Colophon The animal on the cover of Hands-On Machine Learning with Scikit-Learn and Ten‐ sorFlow is the fire salamander (Salamandra salamandra), an amphibian found across most of Europe. Its black, glossy skin features large yellow spots on the head and back, signaling the presence of alkaloid toxins. This is a possible source of this amphibian’s common name: contact with these toxins (which they can also spray short distances) causes convulsions and hyperventilation. Either the painful poisons or the moistness of the salamander’s skin (or both) led to a misguided belief that these creatures not only could survive being placed in fire but could extinguish it as well. Fire salamanders live in shaded forests, hiding in moist crevices and under logs near

Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-OReilly-Media-2019.pdf

Document Details

Tags

Related

Full Transcript