Deep Learning for MSc Lecture 1 Introduction PDF
Document Details
Uploaded by AdventurousPraseodymium
Kevin Bryson
Tags
Summary
This document presents a lecture on deep learning, specifically an introduction to the topic. It traces the historical evolution of deep learning and explores developments in algorithms, hardware, and data engineering alongside the ethical considerations of deep learning. The document also covers generative adversarial networks and style transfer with an emphasis on practical applications and implications.
Full Transcript
Deep Learning for MSc Lecture 1 - Introduction KEVIN BRYSON A brief history of Deep Learning ▪ Artificial neurons (McCullough-Pitts, 1943) ▪ Atari Game Playing using Deep Reinforcement Learning (2013) ▪ Perceptron learning machine (Rosen...
Deep Learning for MSc Lecture 1 - Introduction KEVIN BRYSON A brief history of Deep Learning ▪ Artificial neurons (McCullough-Pitts, 1943) ▪ Atari Game Playing using Deep Reinforcement Learning (2013) ▪ Perceptron learning machine (Rosenblatt, 1958) ▪ Generative Adversarial Networks (GANs) presented ▪ Human visual cortex (Hubel-Wiesel, 1959) (Goodfellow et al. 2014) ▪ AlphaGo beats world champion (2016) ▪ Neocognitron (Fukushima, 1980) ▪ GoogleBrain introduces the transformer model with ▪ Back-propagation to train ANNs (Rumelhart, “Attention is All You Need” (2017) Hinton, Williams, 1986) ▪ AlphaFold beats all the experts at protein folding in ▪ LeNet (first CNN) (Yann LeCun et al., 1989) CASP13 (2019) ▪ AlexNet 2012 (Alex Krizhevsky, Ilya Sutskever, ▪ OpenAI releases the large language model ChatGPT Geoffrey Hinton) (2022) So it’s not new - why the fuss now? ❑ Deep learning is a generalisation of a neural network as a graph of tensor operators, using ❑ The chain rule (aka, “back-propagation) ❑ Stochastic gradient descent ❑ Convolutions ❑ Parallel processing on GPUs ❑ Big data ❑ So at first glance, not much difference from the 90s... High performance for ImageNet recognition Image from Stanford cs231 notes 4 Impact in lots of application areas (a number not possible with traditional ML) Near human-level Image processing, classification. Realistic generation of images and videos ◦ Segmentation, labelling, classification (DeepFakes) Handwriting transcription Superhuman performance at games such as Go, Chess, Starcraft, … Audio classification ◦ E.g. Near human-level Speech recognition (in some Widely used in Facebook, Google, Amazon, Spotify. contexts), ◦ music information retrieval Controlling life critical systems like semi- autonomous cars Natural language processing ◦ Safer than human drivers in many contexts, not yet ◦ Automatic machine translation beating traditional feasible in more challenging contexts. approaches Science including eye disease diagnosis, cancer diagnosis, protein structure prediction, controlling fusion reactors, etc. 5 Tesla Video of lecture by Tesla head of AI https://vimeo.com/272696002?cjevent=575e40b30d 2b11e9802e06840a180514 and related blog: https://medium.com/@karpathy/software-2-0- a64152b37c35 Images & video from www.tesla.com 6 Important People Does anyone know who this important person is in terms of Deep Learning Research ? Possibly one of the fathers of deep learning Geoffrey Hinton? Generative Adversarial Networks & Style transfer Image Style Transfer Using Convolutional Neural Networks, Leon A. Gatys, Alexander S. Ecker, Matthias Bethge; IEEE Conf. on Computer Vision & Pattern Recognition (CVPR), 2016, pp. 2414-2423 8 Generative Adversarial Networks & Style transfer https://www.youtube.com/watch?v=Khuj4ASldmU&feature=youtu.be 9 Generating imaginary worlds… https://www.youtube.com/watch?v=5zlcXTCpQqM 10 DeepMind: Treatment of Eye Disease Nature paper 11 AlphaGo Nature paper, Mastering the game of Go without human knowledge 12 WaveNet: Speech from text https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio 13 Alphafold: Solving the protein folding problem https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology ChatGTP … sometimes just wants to agree with you … even if you are wrong … Why is Deep Learning so successful? ❑ The success of deep learning is Multi-factorial: ❑ Algorithms: Five decades of research in machine learning, ❑ Hardware: CPUs/GPUs/storage developed for other purposes, ❑ Tools: tools and culture of collaborative and reproducible science, ❑ Data: lots of data readily from “the Internet”, social networks, autonomous cars ❑ Money: Resources and efforts from large corporations. Thousands of researchers and developers attracted to the area. 16 Algorithms & ML progress Five decades of research in ML provided: ❑ A taxonomy of ML concepts (classification, generative models, clustering, kernels, linear embeddings, etc.), ❑ A sound statistical formalization (Bayesian estimation), ❑ A clear picture of fundamental issues (bias/variance dilemma, VC dimension, generalization bounds, etc.), ❑ A good understanding of optimization issues, ❑ Efficient large-scale algorithms. 17 New developments since the 1980’s: Algorithms ❑ Convolutional networks ❑ Pooling ❑ Regularisation methods (e.g. Batch normalisation) ❑ Specific efficient activation functions ❑ Better weight initialisation techniques ❑ Some improvements in optimisation techniques. Once this collection of developments began to allow networks with more than 10 layers, we started to see deep learning improve its relative performance. Now networks of thousands of layers can be trained. 18 New Developments: Hardware Computational frameworks ◦ CPUs in 2010 were 5000 times faster than 1990 ◦ NVIDIA launched CUDA in 2007 – a programming interface for their GPUs. ◦ GPUs give factor 10* improvement over powerful CPUs ◦ Custom designed hardware such as Google’s TPUs will increase by another order of magnitude with significant power savings. More memory, more disk space, larger networks 19 New Developments: Collaboration, Software & Data Engineering Better open-source development environments ◦ Theano, TensorFlow ◦ Python, numPy ◦ Keras ◦ PyTorch Faster sharing of scientific papers via arxiv pre-print server Peer pressure on authors to share working implementations and data More training data available in many areas (e.g. image processing) ◦ Deep learning models are complex with many parameters, and do better relatively as the amount of data available increases. ◦ Investment in creating and annotating large-scale training sets such as ImageNet Open, competitive environments such as Kaggle can allow engineers to push the envelope, and companies can gauge their performance against a wider range of participants. 20 New Developments: Data Major effort has been put into creating curated data sets to help train networks for different tasks. MNIST – for handwritten digit recognition Imagenet – for image understanding. 21 MNIST scanned handwritten digits http://yann.lecun.com/exdb/mnist/ 22 Imagenet http://image-net.org 23 From a practical perspective, deep learning: ❑ Makes the design of large learning architectures a system/software development task, ❑ Allows to leverage modern hardware (clusters of GPUs), ❑ Does not plateau when using more data, ❑ Makes large pre-trained networks a commodity. 24 Ethics, morality and robustness Is what we are building ethically acceptable? How sure are we that it does what we have trained it to do? (Particularly if it is safety critical.) How are we sure that it will not be misused to the detriment of society ? Bloomberg Quicktake - DeepFakes https://www.youtube.com/watch?v=gLoI9hAX9dw 25 What is Deep Learning? “Learning hierarchical representations from large amounts of data” ❑ High performance: provides viable solutions to hitherto unsolvable problems ❑ High computational cost! Require dedicated hardware ❑ Dependent on optimised libraries (PyTorch, Tensorflow) ❑ Hard to interpret! How does it relate to AI? G O O DF E LLOW E T A L , 2016 Starting from traditional Machine Learning Dog? model Cat? Machine Learning Dog Dog Dog Dog? (tunable) model Cat? 𝑓(𝑥; 𝜃) Cat Cat 𝐿[𝑓 𝑥; 𝜃 , 𝑦] Cat Machine Learning Dog Dog Dog Dog? (tunable) model features Cat? 𝑓(𝑥; 𝜃) Cat Cat 𝐿[𝑓 𝑥; 𝜃 , 𝑦] Cat What are features? “Features” are a way to describe relevant https://doi.org/10.1371/journal.pone.0092137.g003 configurations of observable variables ◦ One example would be PCA or ICA ◦ In computer vision, edges and contrasts are often used (SIFT) ◦ In sound processing, Frequencies via Fourier https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd components (MFCC) ◦ In text, word embeddings ◦ Etc… They are usually engineered from domain knowledge, or optimised using statistical criteria. http://jalammar.github.io/illustrated-word2vec/ Deep vs shallow architectures Dog Dog Dog Dog? (tunable) model features Cat? 𝑓(𝑥; 𝜃) Cat Cat 𝐿[𝑓 𝑥; 𝜃 , 𝑦] Cat Deep vs shallow architectures Dog Dog Dog Dog? Cat? Cat Cat 𝐿[𝑓 𝑥; 𝜃 , 𝑦] Cat Deep vs shallow architectures Each layer learns a more abstract representation of data. More abstract concepts are usually more invariant to local changes The last layer is a classical neural network, but the task is easier by working on a better representation All representations are learnt “end-to-end” from the objective function (ie, minimise a loss) Learning representations from data One reason for the success of Deep learning has been the automation of what was a crucial step in ML workflow: feature engineering ML models transform input data into, hopefully, meaningful outputs by training on examples of inputs and outputs. Central problem in ML: meaningfully transform data. ◦ Learn useful representations of input data that get us closer to the expected outputs, which make the task at hand easier. 36 Changing the representation makes task easier (Chollet, 37 2018) Representations matter Could a single linear threshold separate these groups of data? https://www.deeplearningbook.org/slides/01_intro.pdf 38 Representations matter https://www.deeplearningbook.org/slides/01_intro.pdf 39 What are the limits of a linear classifier? If the data cannot be separated by a hyperplane, then a linear classifier cannot solve the problem in this space. The probability that a randomly selected hyperplane separates P samples of dimension N goes to zero very rapidly as P increases. 40 Image from http://www.cs.uccs.edu/~jkalita/work/cs587/2014/03SimpleNets.pdf Cover’s Theorem Cover’s Theorem states that given a set of training data that is not linearly separable, one can with high probability transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some non-linear transformation. So should we just find some mapping that expands our feature space to a higher-dimensional one, where it will be easier to find a separating function? 41 Flat vs Deep A neural network with a single hidden layer that is wide enough can represent any function (Cybenko, 1989), but not necessarily learn it. ◦ Certain functions, like parity, may require exponentially many hidden units (in the number of inputs) Deep networks with multiple hidden layers may be exponentially more efficient Shallow nets might overfit more. 42 Multidigit number transcription (Goodfellow 2014) (Chollet, 43 2018) (Chollet, 44 2018) (Chollet, 45 2018) 3 design questions 1. What architecture? (layers, size, type) 2. What loss function? 3. What optimisation method? 46 Reading / Watching Re-read your machine learning course notes to remind yourself of the major topics. Look through some of the video examples / web pages contained within these lecture notes. Read Chapter 1 of Goodfellow’s Deep Learning book: https://www.deeplearningbook.org/contents/intro.html Read some of the examples of applications of deep learning from Chapter 13 of Drori’s The Science of Deep Learning book. 47 Questions ?