generalization-and-features.pdf

Generalization Prepared by: Joseph Bakarji Compression Measurements Learning Seen World Data Modeling...

Generalization Prepared by: Joseph Bakarji Compression Measurements Learning Seen World Data Modeling Unseen Inference Projection Reconstruction Given new input, what’s the output? Input Output x y y (1) (1) x y ? (2) (2) x y yq (3) (3) x y (4) (4) x y x ….. ….. xq Query xq ?? Prediction Linear Regression 5. Predict unseen data ypred = hθ(x ̂ new) 1. Assume a linear hypothesis y d ⊤ ∑ hθ(x) = θ x = θi xi i=0 ypred 4. Optimal predictor 2. Cost function y = hθ(x) ̂ 1 d x ( ) 2 ( ) (i) (i) xnew 2 ∑ J(θ) = hθ x − y i=1 θ̂ SGD 3. Minimize: Gradient Descent for t = 1…T: ∂J(θ) for i = 1…n: θi := θi − α ∂θi θ := θ − α (hθ (x ) − y ) x (i) (i) (i) Given new input, what’s the output? Linear Interpolation y h h x y yq Given the data, nd a function h, also called a hypothesis, that predicts an output, given an input xq x fi Given new input, what’s the output? Polynomial Interpolation y h x y yq Given the data, nd a function h, also called a hypothesis, that predicts an output, given an input xq x fi Given new input, what’s the output? Some other function? y h x y yq Given the data, nd a function h, also called a hypothesis, that predicts an output, given an input xq x fi How do you choose the hypothesis? Some other function? y h x y yq Given the data, nd a function h, also called a hypothesis, that predicts an output, given an input xq x fi What happens if we have more inputs? Assume a linear hypothesis Inputs Output h x1 x2 y x y (1) (1) (1) x1 x2 y (2) (2) (2) x1 x2 y hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + … (3) (3) (3) x1 x2 y (4) (4) (4) x1 x2 y hθ(x) = [θ0, θ1, θ2, θ3, …] ⋅ [1, x1, x2, x3, …] x ….. ….. ….. θ weights inputs ⊤ hθ(x) = θ ⋅ x = θ x Input Output Feature Engineering x x (1) y y (1) h is does not have to be linear in x x (2) x (3) y (2) y (3) x (4) y (4) Some other function? … Example: construct a polynomial model … y 2 3 hθ(x) = θ0 + θ1x + θ2x + θ3x + … yq 2 3 hθ(x) = [θ0, θ1, θ2, θ3, …] ⋅ [1, x, x , x , …] θ ϕ(x) Feature map xq x ⊤ hθ(x) = θ ϕ(x) Input Output Feature Engineering x x (1) y y (1) h is does not have to be linear in x x (2) x (3) y (2) y (3) x (4) y (4) Some other function? … Example: construct a polynomial model … y 2 3 hθ(x) = θ0 + θ1x + θ2x + θ3x + … yq 2 3 hθ(x) = [θ0, θ1, θ2, θ3, …] ⋅ [1, x, x , x , …] θ ϕ(x) Feature map xq x ⊤ hθ(x) = θ ϕ(x) = θ0ϕ0(x) + θ1ϕ1(x) + θ2ϕ2(x) + … Feature Engineering A feature map can also drop features Some other function? Example: construct a polynomial model y 3 hθ(x) = θ1x + θ3x yq 3 hθ(x) = [θ1, θ3] ⋅ [x, x ] ϕ(x) Feature map ⊤ xq x hθ(x) = θ ϕ(x) How to choose ϕ(x)? How to optimize over ϕ(x) Under tting Just right Over tting High Bias High Variance y y y x x x ϕ(x) = [1, x] 2 ϕ(x) = [1, x, x ] 2 3 ϕ(x) = [1, x, x , x , …] fi fi How can we tell if ϕ( ⋅ ) is good? The purpose of Machine Learning is to Generalize to unseen data Small Loss y Hold out set y on test set y Large Loss on test set x x x Create a test set 2 ϕ(x) = [1, x] to evaluate model ϕ(x) = [1, x, x ] How do we tell that ϕ( ⋅ ) is good? De ne objective functions for each subset dtr dtt Split data: y Test Set Training Test Set Set dtr 1 ( θ ϕ(x ) − y ) ⊤ (i) (i) 2 2dtr ∑ Jtrain(θ) = i=1 x dtt 1 ( θ ϕ(x ) − y ) ⊤ (i) (i) 2 2dtt ∑ Create a Test set to evaluate model Jtest(θ) = i=1 fi Variance Bias Trade-off Error as a function of complexity Loss Jtest(θ) Jtrain(θ) Complexity 2 3 # parameters ϕ(x) = [1, x] ϕ(x) = [1, x, x , x , …] Other Hyperparameters ϕ is not the only unknown parameter over which we want to optimize T: Number of Epochs η: Step size ϕ: Feature vector Optimize over ϕ and other hyperparameters y AAACAXicdVDLSgNBEJyNrxhfUY9eBoPgadkNJsZbQA8eEzAxkCxhdjKbDJnHMjMrLEtOHr3qR3gTr36J3+BPOHkQjGhBQ1NVTXdXGDOqjed9Orm19Y3Nrfx2YWd3b/+geHjU1jJRmLSwZFJ1QqQJo4K0DDWMdGJFEA8ZuQ/H11P9/oEoTaW4M2lMAo6GgkYUI2OpZtovljz3qlaplCvQd70ZoGUsqtUlUwILNPrFr95A4oQTYTBDWnd9LzZBhpShmJFJoZdoEiM8RkPSta1AnOggmx06gWeWGcBIKlvCwBn7cyJDXOuUh9bJkRnp39qU/EvrJiaqBRkVcWKIwPNFUcKgkXD6NRxQRbBhqW0QVtTeCvEIKYSNzWZ1CxtKaxjxsn3GprMM5f+mXXZ9z/WbF6X6zSKnPDgBp+Ac+OAS1MEtaIAWwICAJ/AMXpxH59V5c97n1pyzmDkGK3A+vgETiZe9 Visualize Training Set y Choose Feature Vector: ϕ(x) x Training Test x AAACAXicdVDJSgNBEO2JW4xb1KOXxiB4GmaCifEW0IPHBMwCyRB6Oj1Jk16G7h4xDDl59Kof4U28+iV+gz9hZyEY0QcFxXuvqKoXxoxq43mfTmZtfWNzK7ud29nd2z/IHx41tUwUJg0smVTtEGnCqCANQw0j7VgRxENGWuHoeqq37onSVIo7M45JwNFA0IhiZCxVf+jlC557VSmViiXou94M0DIW5fKSKYAFar38V7cvccKJMJghrTu+F5sgRcpQzMgk1000iREeoQHp2FYgTnSQzg6dwDPL9GEklS1h4Iz9OZEirvWYh9bJkRnq39qU/EvrJCaqBCkVcWKIwPNFUcKgkXD6NexTRbBhY9sgrKi9FeIhUggbm83qFjaQ1jDkRfuMTWcZyv9Ns+j6nuvXLwrVm0VOWXACTsE58MElqIJbUAMNgAEBT+AZvDiPzqvz5rzPrRlnMXMMVuB8fAMR7Je8

generalization-and-features.pdf

Document Details

Related

Full Transcript