AutoMLWS23_merged_compressed PDF
Document Details
Uploaded by FondDalmatianJasper
Technische Universität Wien
Nysret Musliu
Tags
Summary
This document discusses automated machine learning (AutoML), focusing on algorithm selection and hyperparameter optimization. It covers different machine learning algorithms, how to evaluate them, and the challenges in choosing the best one for a specific task. The document explores the concept of metalearning, which involves learning from past learning experiences. It also describes how to create training metadata and computational cost associated with feature extraction and selection.
Full Transcript
................................................. 194.025: Introduction to Machine Learning Automated Machine Learning Nysret Musliu Databases and Artificial Intelligence Group Outli...
................................................. 194.025: Introduction to Machine Learning Automated Machine Learning Nysret Musliu Databases and Artificial Intelligence Group Outline................................................. Motivation and problem definition – Algorithm selection – Hyperparameter optimization Automated algorithm selection Hyperparameter optimization Automating machine learning pipeline AutoML systems Motivation: Algorithm Selection................................................. Different machine learning algorithms available – k-nn – Decision trees – Random forest – Bayesian networks – Support vector machines – Neural networks – … k-nn NN DT … Algorithm Apply the selected Data Set … Selection algorithm No Free Lunch (NFL) theorem................................................. “…for any algorithm, any elevated performance over one class of problems is offset by performance over another class” “any two algorithms are equivalent when their performance is averaged across all possible problems“ Implications of NFL theorem in machine learning – How to evaluate a new classification/regression algorithm? – Which algorithm to use for a new classification/regression task? David Wolpert, William G. Macready: No free lunch theorems for optimization. IEEE Transac. Evolutionary Computation 1(1): 67-82 (1997) Wolpert, D.H., and Macready, W.G. (2005) "Coevolutionary free lunches," IEEE Transac. on Evolutionary Computation, 9(6): 721-735 Motivation: Hyperparameter optimization................................................. Machine learning algorithms have different hyperparameters – k-nn: number of neighbors, distance metric… – Neural networks: number of layers, activation functions… – Random forest: number of trees, number of featues…. – … Configuration/Setting of parameters often very important to obtain good results – Discrete parameters – Continuous parameters – Conditional parameters Large search space of possible configurations How to select the best values for hyperparameters? Hyperparameter optimization................................................. Formal problem definition [[H2015]]: A machine learning algorithm A Parameters: λ1, …, λn Domains: Λ1,… Λn Hyperparameter space: Λ = Λ1 x… x Λn A λ : algorithm A uses hyperparameter setting λ L(A λ, Dtrain, Dvalid): Validation loss (e.g. misclassification rate) Optimization problem under k-fold cross-validations is to minimize the black box function below: Algorithm selection and hyperparameter optimization in practice................................................. Users usually – make a pre-selection of some techniques – experiment with different variants and hyperparameters – experiment with different pre-processing steps – select the best technique for a particular application Disadvantages – this approach requires usually much time – learning systems are not flexible for new applications Algorithm selection/configuration problem! AutoML................................................. Process of automating of machine learning when applied to a data set Automated optimization of hyperparameters Automated algorithm selection Automated feature selection, preprocessing… Algorithm selection in machine learning (Metalearning)................................................. Learning about learning Learning -> accumulating experience on a specific learning task (e.g. different applications from Machine Learning Repository) Metalearning -> accumulating experience on the performance of the machine learning algorithms in multiple applications – Deals with many machine learning techniques – Dynamical model selection method combination Algorithm selection with Rice’s framework................................................. Figure taken from John R. Rice: The Algorithm Selection Problem. Advances in Computers 15: 65-118 (1976) Kate Smith-Miles: Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41(1): (2008) Algorithm selection................................................. Input (see and ): Problem space P that represents the set of instances of a problem class A feature space F that contains measurable characteristics of the instances generated by a computational feature extraction process applied to P Set A of all considered algorithms for tackling the problem The performance space Y represents the mapping of each algorithm to a set of performance metrics Problem: For a given problem instance x E P, with features f(x) E F, find the selection mapping S(f(x)) into algorithm space , such that the selected algorithm a E A maximizes the performance mapping y(a(x)) E Y John R. Rice: The Algorithm Selection Problem. Advances in Computers 15: 65-118 (1976) Kate Smith-Miles: Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41(1): (2008) Algorithm selection issues................................................. How to select – f – S – y In Metalearning A -> set of base-level learning algorithms S -> is also a learning algorithm Important – Construction of training data – The content of A – The computational cost of f and S Algorithm Selection vs. Model Combination................................................. Model combination – Creating of a single learning system from different learning algorithms The aim is to reduce the probability of misclassification by combining different techniques E.g ensemble methods – Is also considered as metalearning The content of algorithm space A................................................. According to NFL theorem no learning algorithm is the best in all applications Some learners perform better in some domains Select complementary base learner algorithms for A Ideally the smallest set of algorithms should be selected that give a good coverage of space of all learning algorithms In practice is hard to have a good coverage with minimal set of learning algorithms The content of algorithm space A................................................. Base learners should have different biases Representatives from different model classes (e.g. Bayesian learning, Decision tree, …) Experts usually – select a model class based on characteristics (attribute type,..) of the task – select the most promising algorithm from the selected class Training Metadata................................................. Data about problems If the number of real world classification tasks is small – Increase the tasks with artificial generated problems – Algorithm selection is inherently incremental Meta examples – , t(x) -> target value for x t(x) = argmmaxa ∊ A y(a, x) Meta-learning takes { : x ∊ P‘ ⊆P} as training set – Predicts the best algorithm for a new classification task Training Metadata: Example................................................. Nr. features Nr. instances … Best algorithm Dataset1 5 600 … … Decision Tree Dataset2 3 200 … … Naïve Bayes Dataset3 25 20000 … … Neural Networks Dataset4 56 4500 … … Neural Networks DataSet5 8 1000 … …. Random Forest … … … … … … S(f(x)): Learn a model (e.g. decision tree…) Classify Nr. features Nr. instances … Best algorithm NewDataSet1 20 2000 ? NewDataSet2 500 400000 … ? Feature space................................................. Very important to choose appropriate features Features must have some predictive power Different characterization – Statistical and Information-theoretic – Model-based – Landmarking – … Statistical and Information-theoretic features................................................. Different features from data set are extracted – Number of attributes – Number of classes – Ratio of examples to attributes – Average class entropy – Degree of correlation between features and target concept – … Assumption: Learning algorithms are sensitive to the structure of the data set The size of the training set also plays an important role Model-based characterization................................................. Properties of hypothesis induced on a particular problem are used as an indirect form of characterization Decision trees have been considered – f(x) consist of Properties of the tree: – Nodes per feature – Maximum tree depth – Tree imbalance – … … Landmarking (I)................................................. Each learner has the class of tasks in which performs very well (area of expertise of a learner) Performance of the algorithm on a task says something about the nature of the task Landmarker (landmark learner) -> a learning mechanism whose performance is used to describe a task Locate the task in the expertise space Expertise map is used by landmarking as the main source of the information Landmarking (II)................................................. Suppose that i1, i2, i3 are taken as landmarkers One possible conclusion – Problems on which i1 and i3 perform well, and i2 poor are likely to be in the expertise of i4 Landmarking is simple: learners are used to signpost learners We want landmarkers to be efficient The cost of running landmarkers should be smaller that the cost of running all learners in A -> otherwise we could run all learners and find which is the best Use landmarkers that are more efficient -> e.g. use naive algorithms (naive bayes, OneR, …) Figure taken from metalearning tutorial Landmarking has given good results given to the end of these slides Computational cost of f and S................................................. The cost of computation of f(x) should be much lower than the cost of computing t(x) Regarding S – Cost of inducing the metamodel Depends from the chosen regime – Cost for making prediction with the metamodel Usually is not problematic Selecting y................................................. Predicting accuracy main criterion for algorithm selection Other performance measures – Computational complexity – Compactness – Expressiveness – … Another possibility is the ranking of algorithms – For every new problem algorithms are ranked by decreasing performance Hyperparameter optimization................................................. Grid search – Exhaustive search – Continuous parameters have to be discretized – Cartesian product of sets (values of parameters) – Cross-validation is usually used for evaluation Randomized search – Random selection of configurations in the search space – Can outperform grid search – Specifying the distribution from which to sample Sequential Model-based Bayesian Optimization................................................. General framework for minimizing blackbox functions f A probabilistic model M is used to model f based – on point evaluation of f – available prior information Applies M to select promising inputs to evaluate f next M is a regression model fitted on data tuples (λ, L(Aλ)) Selection of next hyperparameter configuration λ : – Acquisition function: aM: Λ -> R – Most useful configuration λ is evaluated next – Most popular acquisition function is expected improvement over the best previously-observed function value fmin - Probabilistic model M is used to predict a posterior distribution over function values pM(f | λ) Bayesian Optimization.................................................................................................. Source: [H2015] Automating supervised machine learning pipeline................................................. Source: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. https://dl.acm.org/citation.cfm?id=2908918 Metalearning and hyperparameter optimization................................................. Hyperparameter optimization is a special case of model selection Selecting a specific learning algorithm can be viewed as optimizing a nominal hyperparameter Preprocessing steps such as data normalization can be treated as nominal hyperparameters Use meta-features for initialization of the parameters of the single dataset AutoML Systems................................................. Auto-Sklearn (Chapter 6 in [H2018] ): https://automl.github.io/auto-sklearn/master/ Auto-WEKA (see Chapter 4 in [H2018]): https://www.cs.ubc.ca/labs/algorithms/Projects/autoweka/ TPOT ([O2016]): http://epistasislab.github.io/tpot/ H2O: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html Auto-PyTorch: https://github.com/automl/Auto-PyTorch … Example of parameter optimization: SMAC [H2011]................................................. Can be used for optimization of parameters for arbitrary algorithms Based on set of instances Optimization of hard combinatorial problem solvers Hyperparameter optimization of machine learning algorithms Uses random forest to model pM(f | λ) Fitting a regression tree to data................................................. Source: [H2011]................................................. Auto-Sklearn ([F2017])................................................. Literature................................................. [H2018] AutoML: Methods, Systems, Challenges (new book). Editors: Frank Hutter, Lars Kotthoff, Joaquin Vanschoren. https://www.automl.org/book/ [H2015] Frank Hutter, Jörg Lücke, Lars Schmidt-Thieme: Beyond Manual Tuning of Hyperparameters. KI 29(4): 329-337 (2015) [H2011] Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown: Sequential Model-Based Optimization for General Algorithm Configuration. LION 2011: 507-523 [O2016] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, Jason H. Moore: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. GECCO 2016: 485-492: https://dl.acm.org/citation.cfm?id=2908918 [F2017] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter: Efficient and Robust Automated Machine Learning. NIPS 2015: 2962-2970 Literature: Metalearning and algorithm selection................................................. [G2008] Christophe Giraud-Carrier. Metalearning - A Tutorial, ICMLA 2008 – http://www.icmla-conference.org/icmla08/slides2.pdf [S2008] Kate Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41(1): (2008 Bias and Fairness in AI (Assistant Prof.) Stefan Neumann, Introduction to Machine Learning, 22.01.2025 1 On the Real-World Impact of ML Systems ML systems (and the like) are regularly used to make decisions a ecting people Sometimes this impact is obvious, sometimes it is less so Today we will talk about one obvious example and less obvious example We will see that avoiding these negative impacts can be surprisingly di cult Simplifying assumptions today: We only consider binary classi cation with labels 0/1 (predicting Y ̂ ∈ {0,1}) and two protected groups. 2 fi ff ffi Who of you has heard about COMPAS? 3 COMPAS COMPAS = Correctional O ender Management Pro ling for Alternative Sanctions Algorithm to assess potential recidivism risk Recidivism (“Rückfälligkeit”): the tendency of a convicted criminal to reo end Used for di erent decisions: pretrial, sentencing, parole, probation Used in the U.S. by judges and in prisons for at least 9 states 4 ff ff fi ff Error Rate of COMPAS People looked at error rate of COMPAS For white defendants: 33% of predictions incorrect For black defendants: 36.2% of predictions incorrect Yet, people claimed that the classi er is unfair towards black people. Can you think of a reason? Discuss with your neighbor for three minutes. 5 fi Criticism of COMPAS “Our analysis of COMPAS found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly agged as low risk.” https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm 6 fl FP + FN Error rate = Confusion Matrix TN + TP + FP + FN Percentage of wrong predictions FP False positive rate = Prediction: Prediction: FP + TN Low Risk High Risk How often were o enders predicted not to reo end? Defendants care about this. Ground-Truth: True Negative False Positive Did not recidivate (TN) (FP) FN False negative rate = FN + TP Ground-Truth: False Negative True Positive Recidivated (FN) (TP) How often were non-o enders predicted to reo end? Judges care about this. 7 ff ff ff ff FP + FN ≈ 34.6 % Confusion Matrix Error rate = TN + TP + FP + FN Entire Population Percentage of wrong predictions FP False positive rate = ≈ 32.4 % Prediction: Prediction: FP + TN Low Risk High Risk How often were o enders predicted not to reo end? Ground-Truth: 2681 1282 Defendants care about this. Did not recidivate (TN) (FP) FN False negative rate = ≈ 37.4 FN + TP Ground-Truth: 1216 2035 Recidivated (FN) (TP) How often were non-o enders predicted to reo end? 8 Judges care about this. ff ff ff ff Confusion Matrix Black vs White Population Black Prediction: Prediction: White Prediction: Prediction: Defendants Low Risk High Risk Defendants Low Risk High Risk Ground-Truth: Ground-Truth: 990 805 1139 349 Did not Did not (TN) (FP) (TN) (FP) recidivate recidivate Ground-Truth: 532 1369 Ground-Truth: 461 505 Recidivated (FN) (TP) Recidivated (FN) (TP) Error rate = 36.2 % Error rate = 33.0 % False positive rate = 44.9 % False positive rate = 23.5 % False negative rate = 28.0 % 9 False negative rate = 47.7 % Fairness De nitions 10 fi Fairness De nition 1: Naïve Attempt Intuition: The classi er assigns label 1 to groups at the same rate Predict label Y ̂ based on features X and group A: ℙ(Y ̂ = 1 ∣ A = 1) = ℙ(Y ̂ = 1 ∣ A = 0) Pro: Both classes get same fraction of 1-labels Con: The assumption is too strong. If the di erent groups have di erent base rates, then even the ground-truth data is unfair. Does not care about errors of the classi er: The de nition is satis ed if we always predict 1 or if we just ip a coin. 11 fi fl fi ff fi fi ff fi Fairness De nition 2: Calibration Intuition: The true outcome is independent of group memberships conditional on score Same likelihood of Y = 1 given features X, group A and score S = S(X): ℙ(Y = 1 ∣ A = 1, S = s) = ℙ(Y = 1 ∣ A = 0, S = s) = s Note that it is Y and not Y ̂ in the equation above (in binary setting think of S = Y)̂ We assume that S can be interpreted as a probability Pro: Prediction depends on underlying risk, not on group membership Con: Distribution of errors can di er among di erent groups (see next slides) 12 fi ff ff COMPAS: Calibration Alexandra Chouldechova: Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 5(2): 153-163 (2017) 13 More info: https://afraenkel.github.io/fairness-book/content/08-compas-2.html Fairness De nition 3: Error Rate Balance Intuition: Equal FPR and FNR across groups Equal FPR (False Positive Rate): ℙ(Y ̂ = 1 ∣ A = 0, Y = 0) = ℙ(Y ̂ = 1 ∣ A = 1, Y = 0) Equal FNR (False Negative Rate): ̂ ̂ ℙ(Y = 0 ∣ A = 0, Y = 1) = ℙ(Y = 0 ∣ A = 1, Y = 1) Pro: Equally wrong on both groups Con: Cannot be combined with calibration (in a few slides) 14 fi COMPAS: Error Rate Balance Alexandra Chouldechova: Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. 15 Big Data 5(2): 153-163 (2017) Trade-O s Between Fairness De nitions Possible Fairness De nition 4: Satisfy all three previous fairness de nitions simultaneously. That would be great, right? Bad news: This is not possible. Except in very special cases (perfect prediction or equal base rates), there is no method that satis es all three conditions simultaneously. Base rate = percentage of population with label 1 Jon M. Kleinberg, Sendhil Mullainathan, Manish Raghavan: Inherent Trade-O 16 s in the Fair Determination of Risk Scores. ITCS 2017: 43:1-43:23 ff fi fi fi ff fi Implications of Impossibility Results Suppose your job is to develop a fair classi er. What is your conclusion from the previous slide about impossibility results? (Discuss with your neighbor for two minutes.) Not one fairness de nition that works in all settings There must be some sort of policy decision Should involve people from other domains 17 fi fi Fairness Through Unawareness When developing a classi er, can’t we just ignore protected attributes? Protected attributes = race, gender, religion, nationality, disability status, … No! Data often contains correlations that reveal protected attributes Examples: Somebody’s ZIP code is 1100, shops in Halal supermarkets and higher expenses around Ramadan times: that person is probably a Muslim. Somebody’s spending a lot of money in pharmacies and books cruises and receives money from a pension fund: that person is probably a pensioner. Even if we remove the protected attributes, ML algorithms nd such correlations 18 fi fi Fairness “De nition” 4: Individual Fairness So far, we considered so-called “group fairness”. We wanted to protect groups. Individual fairness: Intuition: People with similar attributes should be treated similarly Formally: De ne distance measure d(x, x′) between feature vectors. Enforce that the distance of risk scores S(x) and S(x′) is proportional to d(x, x′). Pro: Can take into account heterogeneity within groups and applicable when the protected groups are unknown. Con: De ning appropriate distance measures is di cult.    Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, Richard 19 S. Zemel: Fairness through awareness. ITCS 2012: 214-226 fi fi fi ffi Some Words About Bias Dataset Bias “Learning” from a dataset will “learn” the bias in the data If the dataset has bias against certain groups, ML algorithms will reproduce this bias For instance, there are examples when the analysis of loans handed out by banks were shown to be biased against certain groups Classi ers must be built to avoid reproducing this bias There are cases in which models are more biased than training data Reducing bias in the dataset is not enough! 20 fi Some Words About Bias Not All Errors Are Equal In many cases, the cost FPs and FNs can be quite di erent. Prediction: In many cases FNs are more costly than FPs: Prediction: Has No Has Disease When checking for diseases Disease When identifying fraud Ground-Truth: True Negative False Positive Hiring at big companies like Google Has no (TN) (FP) disease Many classi ers allow you to set di erent cost for FPs and FNs Ground-Truth: False Negative True Positive Does not achieve optimal error rate, Has disease (FN) (TP) but may be more suitable in the application domain Certain bias is not bad per se, but we must be aware 21 ff fi ff Transparency of Classi ers Model cards: Standard list of questions for releasing trained classi ers Used by Google, OpenAI, supported by Hugging Face Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, Timnit Gebru: Model Cards for Model Reporting. FAT 2019: 220-229 22 fi fi Humans and Risk Scores 23 Humans and Risk Scores People are increasingly aware that ML systems can have hidden biases Di cult to overcome from a purely technical perspective Typical trick: Do not deploy system as is, just provide its output as information “We only give a risk score to a decision maker, but in the end the human makes the decision” Sounds very appealing This is not so easy! Can often have unintended side e ects 24 ffi ff KPRA Example KPRA: Kentucky Pretrial Risk Assessment Policy change HB463 in Kentucky in June 2011: Applied to bail (“Kaution”) decisions — whether people are released from jail early or not Two possible choices: They have to pay money ( nancial bond) or not In case of nancial bonds: 20% people get released immediately, otherwise 96% Risk of o enders was automatically calculated as low / moderate / high Taking this rating into account was optional For low or medium risk o enders, judges had to set a non- nancial bond (less harsh) unless they give a reason for deviating from this default 25Law, Economics, and Business Fellows’ Discussion Paper Series 85 (2019): 2019-1. Albright, Alex. "If you give a judge a risk score: evidence from Kentucky bail decisions." ff fi ff fi fi KPRAS: Entire Population 26Law, Economics, and Business Fellows’ Discussion Paper Series 85 (2019): 2019-1. Albright, Alex. "If you give a judge a risk score: evidence from Kentucky bail decisions." KPRAS: Black vs White Populations 27Law, Economics, and Business Fellows’ Discussion Paper Series 85 (2019): 2019-1. Albright, Alex. "If you give a judge a risk score: evidence from Kentucky bail decisions." KAPR: Interpretation Black moderate risk defendants are treated more harshly than similar white moderate risk defendants after but not before HB463 Judges are 10% more likely to deviate from the non- nancial bond recommendation for moderate black rather than similar moderate white defendants This is suggestive evidence that judges interpret risk score levels di erently based on defendant race Demonstrates how interactions between human discretion and prediction tool recommendations can mean unequal policy e ects across racial groups 28Law, Economics, and Business Fellows’ Discussion Paper Series 85 (2019): 2019-1. Albright, Alex. "If you give a judge a risk score: evidence from Kentucky bail decisions." ff fi ff KAPR: No Black-Box System 29Law, Economics, and Business Fellows’ Discussion Paper Series 85 (2019): 2019-1. Albright, Alex. "If you give a judge a risk score: evidence from Kentucky bail decisions." Takeaways ML algorithms have impact on real people We cannot just ignore that there are protected classes and attributes Ensuring fairness must be taken into account when building the classi er When building classi ers, look at the confusion matrix, think about the cost/impact of FNs and FPs in your application domain There is not a single fairness de nition that always applies Bias is not bad per se Even if risk scores are just given as advice, this can have unexpected consequences 30 fi fi fi Acknowledgements Based on slides by: Irene Y. Chen (link) Justin Johnson and David Fouhey (link) Interesting further information: “Tutorial: 21 fairness de nitions and their politics” by Arvind Narayanan available on YouTube (link) For the COMPAS datasets, some of the references contain links to Jupyter notebook where you can explore the data yourself. There are also more references with concrete code online if you google it. 31 fi Machine Learning Support Vector Machines and other Kernel Methods Univ.-Prof. Dr. Thomas Gärtner January 7, 2025 Machine Learning Research Unit TU Wien Informatics SUPPORT VECTOR MACHINES & OTHER KERNEL METHODS THOMAS GÄRTNER Recall: linear learning algorithms Perceptron Perceptron update rule 𝑥1 𝑓 𝑥 = 𝜎 𝑥, 𝑤 𝑤 (𝑡+1) 𝑢 𝑤 𝑥2 𝑤1 𝑤 (𝑡) 𝑤2 𝑑 𝑥3 𝑤3 𝜎 𝑓 𝑥 = sign 𝑥𝑖 𝑤𝑖 𝑤4 𝑥4 𝑤5 𝑖=1 𝑥5 Random initialisation (arrow depicts wt/∥wt ∥) Perceptron Example First mistake (arrow depicts wt/∥wt ∥) First update (arrow depicts wt/∥wt ∥) Some correctly classified instances and 2nd mistake (arrow depicts wt/∥wt ∥) Second update (arrow depicts wt/∥wt ∥) Some correctly classified examples and 3rd mistake (arrow depicts wt/∥wt ∥) Third update (arrow depicts wt/∥wt ∥) Some correctly classified examples and 4th mistake (arrow depicts wt/∥wt ∥) Fourth update (arrow depicts wt/∥wt ∥) Some correctly classified examples and 5th mistake (arrow depicts wt/∥wt ∥) Fifth update (arrow depicts wt/∥wt ∥) Perceptron in hypothesis space 𝑤 (𝑡+1) X X 𝑢 X 𝑤 (𝑡) Perceptron: Version Space Geometry Perceptron in hypothesis space Perceptron in hypothesis space 𝑤 (𝑡+1) X X 𝑤 (𝑡+1) X X X X 𝑢 𝑤 (𝑡) 𝑢 𝑤 (𝑡) + + + Perceptron in hypothesis space Perceptron in hypothesis space 𝑤 (𝑡+1) X X 𝑤 (𝑡+1) X X X X 𝑢 𝑤 (𝑡) + 𝑢 𝑤 (𝑡) + + + - - Simple Boolean Data (False ↦ −1; True ↦ +1) x (F,T) x (T,T) x (F,F) x (T,F) Perceptron: Expressivity of Boolean Functions AND OR - + - + - - - - + + - + NOT XOR - + + - - + + - ? - + - - + + - - + + + + - - + - + + - - - + + - - + + - XOR Summary: Perceptron (𝑥 (𝑡) ∈ ℝ𝑑 , 𝑦𝑡 ∈ −1, +1 for 𝑡 = 1, 2, …) traditionally used in online learning - + + - 𝑓𝑡 𝑥 = 𝜎 𝑥 (𝑡) , 𝑤 (𝑡) expressivity thresholded linear hypotheses (halfspaces) 0 if 𝑓𝑡 𝑥 (𝑡) = 𝑦𝑡 simple Boolean functions (and, or, … ) 𝑤 (𝑡+1) = 𝑤 (𝑡) + ൝ 𝑦𝑡 𝑥 (𝑡) otherwise ? - + learning theoretic bounds - - + + data dimension based 𝜎: sign function radius-margin bound therefore, onproduct ⋅,⋅ : inner mistake 𝑥 (𝑡) , 𝑤 (𝑡+1) = 𝑥, 𝑤 (𝑡) + 𝑦𝑡 𝑥 (𝑡) downsides linear hypotheses not very expressive + - + + - - no xor/parity type data 𝑤 (𝑡+1) X X X fix kernel perceptron 𝑢 𝑤 (𝑡) + + NO WAY!!! multi-layer perceptron - + + - - Rosenblatt, Frank (1957) The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory Perceptron halfspace 𝑥1 𝑥2 𝑤 Dealing with Non-Linearities: Layering up 𝑥3 𝜎 𝑥4 −𝑤 𝑥5 Layering up Layering up further halfspaces halfspaces conjunction intersection 𝑥1 𝑥1 −𝑤2 𝑥2 𝜎 −𝑤2 𝑥2 𝜎 𝑥3 𝜎 𝑥3 𝜎 𝜎 (apx any) −𝑤3 convex body 𝑥4 𝜎 𝑥4 𝜎 −𝑤1 𝑥5 𝑥5 −𝑤3 −𝑤1 Layering up even further Layering up even further 𝜎 𝜎 disjunction 𝜎 𝜎 disjunction union union 𝑥1 𝜎 𝜎 𝑥1 𝜎 𝜎 𝑥2 𝜎 𝜎 𝑥2 𝜎 𝜎 𝑥3 𝜎 𝜎 𝜎 𝑥3 𝜎 𝜎 𝜎 𝑥4 𝜎 𝜎 𝑥4 𝜎 𝜎 𝑥5 𝜎 𝜎 (apx any) set 𝑥5 𝜎 𝜎 (apx any) set 𝜎 𝜎 𝜎 𝜎 halfspace conjunction halfspace conjunction intersection intersection Also: Any Booolean Function exactly (cf DNF/CNF) ! Non-linear separation: a toy data set becomes linear separable Dealing with Non-Linearities: Mapping up Looking into higher dimensional space Introduction to Kernel Functions Non-linear Perceptrons Non-linear Perceptrons 𝑤 (𝑡+1) 𝑤 (𝑡+1) 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑤 (𝑡) 𝑤 (𝑡) 𝑓 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 by “swapping sums” 𝑖 𝑖 Non-linear Perceptrons Kernel Perceptrons 𝑤 (𝑡+1) 𝑤 (𝑡+1) 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑤 (𝑡) 𝑤 (𝑡) 𝑓 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 𝑓 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 𝑖 𝑖 𝑖 𝑖 we don’t need to be in Euclidean space anymore, 𝑓𝜙 𝑥 = 𝑦𝑖 𝜙(𝑥𝑖 ), 𝜙(𝑥) 𝑓𝜙 𝑥 = 𝑦𝑖 𝜙(𝑥𝑖 ), 𝜙(𝑥) we only need a valid (bi-)linear inner product 𝑖 𝑖 kernels 𝑘: 𝒳 × 𝒳 → ℝ are exactly those functions for which we know that such a feature map 𝜙: 𝒳 → ℋ exists, where ℋ is a Hilbert space and ∀𝑥, 𝑥 ′ ∈ 𝒳: 𝑘 𝑥, 𝑥 ′ = 𝜙 𝑥′ , 𝜙(𝑥) Kernel Perceptrons Kernel Perceptrons 𝑤 (𝑡+1) 𝑤 (𝑡+1) 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑓 𝑥 = 𝑤, 𝑥 where 𝑤 = σ𝑖 𝑦𝑖 𝑥𝑖 and 𝑦𝑖 ∈ ±1 𝑢 𝑤 (𝑡) 𝑤 (𝑡) 𝑓 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 𝑓 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 = 𝑦𝑖 𝑥𝑖 , 𝑥 𝑖 𝑖 𝑖 𝑖 𝑓𝜙 𝑥 = 𝑦𝑖 𝜙(𝑥𝑖 ), 𝜙(𝑥) 𝑓𝑘 𝑥 = 𝛼𝑖 𝑘(𝑥𝑖 , 𝑥) where 𝛼𝑖 ∈ ℝ 𝑓𝜙 𝑥 = 𝑦𝑖 𝜙(𝑥𝑖 ), 𝜙(𝑥) 𝑓𝑘 𝑥 = 𝛼𝑖 𝑘(𝑥𝑖 , 𝑥) where 𝛼𝑖 ∈ ℝ 𝑖 𝑖 𝑖 𝑖 kernels 𝑘: 𝒳 × 𝒳 → ℝ are exactly those functions for which we know that such a kernels 𝑘: 𝒳 × 𝒳 → ℝ are exactly those functions for which we know that such a feature map 𝜙: 𝒳 → ℋ exists, where ℋ is a Hilbert space and ∀𝑥, 𝑥 ′ ∈ 𝒳: 𝑘 𝑥, 𝑥 ′ = 𝜙 𝑥′ , 𝜙(𝑥) feature map 𝜙: 𝒳 → ℋ exists, where ℋ is a Hilbert space and ∀𝑥, 𝑥 ′ ∈ 𝒳: 𝑘 𝑥, 𝑥 ′ = 𝜙 𝑥′ , 𝜙(𝑥) Our first kernel function Our first kernel function Take 𝑘 𝑎, 𝑏 = 𝑎, 𝑏 2 2 𝑎1 𝑏 𝑘 𝑎, 𝑏 ′ = 𝑎 , 1 2 𝑏2 … 𝑘 𝑎, 𝑏′ = 𝑎1 𝑏1 + 𝑎2 𝑏2 2 … 𝑎1 𝑎1 we want: 𝜙 = 𝑎1 𝑎2 we want: 𝜙 = 𝑎1 𝑎2 𝑎2 𝑎2 … 𝑘 𝑎, 𝑏′ = 𝑎1 𝑏1 2 + 2𝑎1 𝑏1 𝑎2 𝑏2 + 𝑎2 𝑏2 2 … 𝑘 𝑎, 𝑏′ = 𝑎12 𝑏12 + 2𝑎1 𝑎2 𝑏1 𝑏2 + 𝑎12 𝑏22 𝑎12 𝑏12 = 2𝑎1 𝑎2 , 2𝑏1 𝑏2 = 𝜙 𝑎 ,𝜙 𝑏 𝑎12 𝑏22 Kernel matrix equivalences and kernel function closure properties Kernel matrix equivalences and kernel function closure properties For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and 1. 𝐾 is positive semi-definite (PSD) 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with 1. 𝐾 is positive semi-definite (PSD) 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. Added benefit: Added benefit: Kernels define valid distance metrics: 𝑑 𝑥, 𝑥 ′ = 𝜙 𝑥 − 𝜙(𝑥 ′ ); 𝜙(𝑥) − 𝜙(𝑥′) Kernels define valid distance metrics: 𝑑 𝑥, 𝑥 ′ = 𝜙 𝑥 − 𝜙(𝑥 ′ ); 𝜙(𝑥) − 𝜙(𝑥′) Kernels define valid distance metrics: 𝑑 𝑥, 𝑥 ′ = 𝑘 𝑥, 𝑥 − 2𝑘 𝑥, 𝑥 ′ + 𝑘(𝑥 ′ , 𝑥′) Kernels define valid distance metrics: 𝑑 𝑥, 𝑥 ′ = 𝑘 𝑥, 𝑥 − 2𝑘 𝑥, 𝑥 ′ + 𝑘(𝑥 ′ , 𝑥′) We can apply any distance based learning algorithm (efficiently) in feature space! We can apply any distance based learning algorithm (efficiently) in feature space! Why are kernels useful in learning algorithms? Why are kernels useful in learning algorithms? Functions with everywhere positive semi-definite Hessian (second derivative) are convex. Functions with everywhere positive semi-definite Hessian (second derivative) are convex. Convex functions are (relatively) easy to minimise. Convex functions are (relatively) easy to minimise. Reminder: Convex Functions & Example: Quadratic Functions Examples of common kernel function convex 𝑥 ∈ ℝ1 polynomials 𝑎 𝑥 2 + 𝑏 𝑥 + 𝑐 are convex for 𝑎 ≥ 0 neither convex nor concave polynomial linear 𝑘 𝑥, 𝑥 ′ = 𝑥, 𝑥 ′ 𝑘 𝑥, 𝑥 ′ = 𝑥, 𝑥 ′ + 𝑙 𝑑 What about concave 𝑥 ∈ ℝ2 polynomials 𝑥 𝑡 𝐴𝑥 + 𝑏 𝑡 𝑥 + 𝑐 ? for 𝑥2 constant: convex in 𝑥1 Consider 𝑥12 + 𝑥22 − 4 𝑥1 𝑥2 for 𝑥1 constant: convex in 𝑥2 BUT for 𝑥1 = 𝑥2 = 𝑥 we get −2𝑥 2 : not convex Gaussian ~sigmoid 𝑘 𝑥, 𝑥 ′ = exp −𝛾 𝑥 − 𝑥 ′ 2 𝑥 ∈ ℝ𝑑 polynomials 𝑥 𝑡 𝐴𝑥 + 𝑏 𝑡 𝑥 + 𝑐 are convex for 𝐴 positive semi-definite [Info: Kernel matrix equivalences & kernel function closure properties ] For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with 1. 𝐾 is positive semi-definite (PSD) 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 But... which Linear Hypothesis? For positive semi-definite kernel functions 𝑔, ℎ and 2. 𝐾 can be factorised 𝛾 ≥ 0, the following functions k are also positive ∃ℓ ∈ ℕ, 𝐹 ∈ ℝℓ×𝑛 : 𝐾= 𝐹𝑡 𝐹 semi-definite kernel functions: 𝑘 𝑥, 𝑥 ′ = 𝛾𝑔 𝑥, 𝑥 ′ 𝑘 𝑥, 𝑥 ′ = 𝑔 𝑥, 𝑥 ′ + ℎ 𝑥, 𝑥 ′ 3. 𝐾 has only non-negative Eigenvalues 𝑘 𝑥, 𝑥 ′ = 𝑔 𝑥, 𝑥 ′ ∗ ℎ 𝑥, 𝑥 ′ K = 𝑈𝐷𝑈 𝑡 , 𝑈 𝑡 𝑈 = 𝑰, ∀𝑖 𝐷𝑖𝑖 ≥ 0, ∀𝑗≠𝑖 𝐷𝑖𝑗 = 0 𝑘 𝑥, 𝑧 , 𝑥 ′ , 𝑧′ = 𝑔 𝑥, 𝑥 ′ ∗ ℎ 𝑧, 𝑦 ′ A simple toy data set Data set with independently bounded noise per feature Data set with norm-bounded noise Data set with increasing, norm-bounded noise Maximally non-overlapping, norm-bounded noise and separating line Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) x ∈ R2 | ⟨x, w ⟩ = 0 w w Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) x ∈ R2 | ⟨x, w ⟩ = 0 x ∈ R2 | ⟨x, w ⟩ = 0 w w γ γ n D E o w x ∈ R2 | x, ∥w ∥ =γ n D E o w x ∈ R2 | x, ∥w ∥ = −γ Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) D E D E w w x ∈ R2 | ⟨x, w ⟩ = 0 halfspace Pw ,γ = {x ∈ R2 | x, ∥w ∥ ≥ γ} x ∈ R2 | ⟨x, w ⟩ = 0 halfspace Pw ,γ = {x ∈ R2 | x, ∥w ∥ ≥ γ} w w data D = {(xi , yi )}ni=1 ⊂ R2 × {±1} γ γ D E n D E o D E n D E o w w w w halfspace Nw ,γ = {x ∈ R2 | x, ∥w ∥ ≤ −γ} x ∈ R2 | x, ∥w ∥ =γ halfspace Nw ,γ = {x ∈ R2 | x, ∥w ∥ ≤ −γ} x ∈ R2 | x, ∥w ∥ =γ n D E o n D E o w w x ∈ R2 | x, ∥w ∥ = −γ x ∈ R2 | x, ∥w ∥ = −γ Supporting vectors ·, separating line —, and margin lines - - - (hyperplanes) D E w x ∈ R2 | ⟨x, w ⟩ = 0 halfspace Pw ,γ = {x ∈ R2 | x, ∥w ∥ ≥ γ} data D = {(xi , yi )}ni=1 ⊂ R2 × {±1} w Version Space Geometry (hard-margin SVM) γ D E n D E o w w halfspace Nw ,γ = {x ∈ R2 | x, ∥w ∥ ≤ −γ} x ∈ R2 | x, ∥w ∥ =γ n D E o w x ∈ R2 | x, ∥w ∥ = −γ enforce ∀i : yi = −1 ⇒ xi ∈ Nw ,γ ∧ yi = +1 ⇒ xi ∈ Pw ,γ Version Space Geometry of SVMs Version Space Geometry of SVMs SVM solution „Bayes point“ x x Brief Learning Theory for Kernel Methods Goal: Low test error Problem: Can‘t minimise test error directly Theorem: With high probability, it holds for all functions 𝑓 ∈ ℋ that test training A tiny bit of Theory and Algorithms error (𝑓) ≤ error (𝑓) + capacity (ℋ) + constants Idea 1: Find a function which minimises the right hand side (rhs) of above equation Problem: Training error is often a non-convex function Idea 2: Find a function which minimises a convex upper bound of the rhs max 0, 1−... 𝜎(−... ) 𝑦𝑓(𝑥) Support Vector Machines Support Vector Machines 1 2 1 2 arg min max 0,1 − 𝑦𝑖 𝑓 𝑥𝑖 +𝜈 𝑓 arg min max 0,1 − 𝑦𝑖 𝑓 𝑥𝑖 +𝜈 𝑓 𝑓∈ℋ 𝑛 𝑓∈ℋ 𝑛 𝑖 𝑖 Representer Theorem: 𝑓 𝑥 = 𝛼𝑗 𝑘 𝑥𝑗 , 𝑥 𝑗 Slack Variables 𝜉𝑖 ≥ max 0, 1 − 𝑦𝑖 𝑓(𝑥𝑖 ) 1 𝑡 arg min𝑛 𝟏 𝜉 + 𝜈𝛼 𝑡 𝐾𝛼 𝛼,𝜉∈ℝ 𝑛 s.t. ∀𝑖: 𝜉𝑖 ≥ 1 − 𝑦𝑖 𝑓(𝑥𝑖 ) s.t. ∀𝑖: 𝜉𝑖 ≥ 0 Neural Networks vs Kernel Methods Non-convex optimisation Convex optimization Non-deterministic results Deterministic polynomial time Restart optimisation many times Run once (S)GD & variants L_BFGS_B, QP, SMO, (S)GD, … Different data types Different data types Example Applications -> different kernel functions -> different architectures Defined non-linearities Learned non-linearities Multiple kernel learning Foundation in Learning Theory Kernels for Basic Terms Structure Elucidation valid! principles type Spectrum = Frequency -> Multiplicity represent individuals by typed terms efficient! type Frequency = Real with modifier Gaussian 0.6 represent sets, multisets, etc. naturally data Multiplicity = s | d | t | q | 0 default 0 logic for learning (Lloyd, 2003) effective? higher-order logic with polymorphic types task the terms of the logic are the terms of the typed -calculus, given the 13C NMR spectrum of Diterpenes, predict the skeleton class it belongs to basic terms are essentially unambiguous closed terms of the logic data basic abstractions are used to represent sets, multisets, etc 1503 spectra classified into 23 skeleton types Spatial & Socio-Demographic Clustering Graph Kernels [colt’03, mlg’03, kdd‘04] type Neighbourhood = Coords -> Statistics type Coords = (Real, Real) with modifier gaussian 0.05 common approach: 0 1 0 0 2 0 0 0 1 0 0 2 0 1 0 0 0 type Statistics = Real**d with modifier normalised 1 2 define kernel as some function of the 2 0 common subgraphs (cf strings, trees, …) 1 1 1 2 0 0 0 0 0 0 2 2 2 1 2 1 2 2 2 1 1 2 1 computing this graph kernel is NP-hard; 0 2 2 0 1 even for paths, cycles, or connected subgraphs (with subgraph-isomorphism as the embedding) computing complete graph kernels is at least as hard as deciding graph isomorphism × direct for efficient computation consider particular graphs-classes different embedding operators, e.g., homomorphism = product graph a lot has happened since! (cf Kriege et al. A Survey on Graph Kernels) Cyclic Discovery Processes – algorithmic challenges Active Search in Intensionally Specified Structured Spaces [aaai’17] design novel build a hypothesis of candidate structures the properties of all pharm-space design analyse structures from the weighted maximum a posteriori huge, intensionally design space specified design cope with the cyclic nature. sample space of structures to improve designs Metropolis-Hastings. cope with the huge candidate space.. sample aim at general-purpose algorithms. interface with domain experts. realise designed make determine properties candidate structures test of realised structures until converged in their context in their context 58 speed. Designing Drugs: Simulations [aaai’17] & Idiopathic Pulmonary Fibrosis [mi’18] non-Hamiltonian graphs greedy (baseline) [ the end -- thank you! ] active search discovered 19 out of the 24 active compounds which are known to be active from previous biological assays! Kernel matrix equivalences and kernel function closure properties For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with 1. 𝐾 is positive semi-definite (PSD) 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 [ Technical Appendix: Kernel Functions ] 2. 𝐾 can be factorised ∃ℓ ∈ ℕ, 𝐹 ∈ ℝℓ×𝑛 : 𝐾 = 𝐹 𝑡 𝐹 3. 𝐾 has only non-negative Eigenvalues K = 𝑈𝐷𝑈 𝑡 , 𝑈 𝑡 𝑈 = 𝑰, ∀𝑖 𝐷𝑖𝑖 ≥ 0, ∀𝑗≠𝑖 𝐷𝑖𝑗 = 0 Kernel matrix equivalences and kernel function closure properties Kernel matrix equivalences and kernel function closure properties For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with 𝑥1 , 𝑥2 , … 𝑥𝑛 ∈ 𝒳 the matrix 𝐾 ∈ ℝ𝑛×𝑛 defined with 1. 𝐾 is positive semi-definite (PSD) 1. 𝐾 is positive semi-definite (PSD) 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. 𝐾𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 is positive semi-definite. ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 ∀𝑐 ∈ ℝ𝑛 : 𝑐 𝑡 𝐾𝑐 ≥ 0 2 𝑐 𝑡 𝐾𝑐 = 𝑐 𝑡 𝐹 𝑡 𝐹𝑐 = 𝐹𝑐 𝑖 = 𝐹𝑐 2 ≥0 2. 𝐾 can be factorised 𝑖 2. 𝐾 can be factorised ∃ℓ ∈ ℕ, 𝐹 ∈ ℝℓ×𝑛 : 𝐾= 𝐹𝑡 𝐹 ∃ℓ ∈ ℕ, 𝐹 ∈ ℝℓ×𝑛 : 𝐾 = 𝐹 𝑡 𝐹 3. 𝐾 has only non-negative Eigenvalues 3. 𝐾 has only non-negative Eigenvalues K = 𝑈𝐷𝑈 𝑡 , 𝑈 𝑡 𝑈 = 𝑰, ∀𝑖 𝐷𝑖𝑖 ≥ 0, ∀𝑗≠𝑖 𝐷𝑖𝑗 = 0 K = 𝑈𝐷𝑈 𝑡 , 𝑈 𝑡 𝑈 = 𝑰, ∀𝑖 𝐷𝑖𝑖 ≥ 0, ∀𝑗≠𝑖 𝐷𝑖𝑗 = 0 Kernel matrix equivalences and kernel function closure properties Kernel matrix equivalences and kernel function closure properties For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive For any symmetric 𝐾 ∈ ℝ𝑛×𝑛 , the following are A symmetric function 𝑘: 𝒳 × 𝒳 → ℝ is positive equivalent: semi-definite (PSD) if and only if for all 𝑛 ∈ ℕ and equivalent: semi